Data Analysis
Computing summary statistics and aggregates over the survey data
Concepts
GROUP BYcollects all rows that share the same value in the specified column into a group- aggregate functions like
AVG,COUNT,SUM, andMAXcompute a single value per group rather than per row
- aggregate functions like
COUNT(*)counts all rows in the group (including those with NULL values)COUNT(column)counts only rows where that column is not NULL- an important distinction for data with missing values
- Integer division in SQLite:
3 / 2returns1, not1.5- to get a decimal result, cast at least one operand:
CAST(3 AS REAL) / 2
- to get a decimal result, cast at least one operand:
- Define the question before writing code
- writing out what you want to know in plain language before asking the LLM to generate SQL produces better, more verifiable queries
- When to use Python instead of SQL: if the computation requires libraries
(NumPy, scipy), complex iteration, or would be clearer as a Python function
- otherwise prefer SQL: the database is faster and keeps the logic close to the data
What researchers actually want
- Mean diameter by site, mutation rate per survey year, correlation between site elevation and shell size
- Defining the questions before writing code
Aggregation queries in SQL
GROUP BY,AVG,COUNT, andMAX- Writing and running these queries directly in
sqlite3 - Reading the results as Python dicts
A summary statistics page
- A new route that runs several queries
- Rendering the results as an HTML table with a brief interpretation
Computed columns
- Deriving mutation rate as a percentage
- Doing arithmetic in SQL versus doing it in Python
- When to pull data out and use Polars or NumPy instead
Check for Understanding
Write a SQL query (without running it) that computes the average shell diameter grouped by site name.
SELECT site, AVG(diameter) AS mean_diameter
FROM snails
GROUP BY site
ORDER BY site;
Each row in the result contains a site name and the mean diameter of all measurements
at that site. ORDER BY site makes the result deterministic.
What is the difference between COUNT(*) and COUNT(diameter)?
COUNT(*) counts every row in the group, regardless of whether any column is NULL.
COUNT(diameter) counts only rows where the diameter column is not NULL. If some
measurements have a missing diameter, COUNT(*) will be larger than COUNT(diameter),
which matters for computing rates or percentages.
If you compute a mutation rate as mutated_count / total_count, what result do you get in SQLite if both values are integers, and how do you fix it?
SQLite performs integer division, so 3 / 10 returns 0 rather than 0.3. To get
a decimal result, cast one value: CAST(mutated_count AS REAL) / total_count or
multiply by 1.0: mutated_count * 1.0 / total_count.
When would you pull data into Python and use Polars or NumPy instead of doing the computation in SQL?
When the computation: requires a library function (e.g., scipy.stats.pearsonr for
correlations, numpy.percentile for percentiles); involves complex control flow that
would be awkward in SQL; or when you need to combine data from multiple queries in
non-trivial ways. For simple aggregates (mean, count, sum, max), SQL is almost always
the better choice because the database is faster and the query documents the intent.
Exercises
Add a trend query
Ask the LLM to write a query that computes the average shell diameter grouped by survey year. Before accepting the output, write out in plain English what the query should do. Run the query and display the results as a simple HTML table on the summary page.
Compare SQL and Python
Implement the same aggregation—mean diameter by site—twice: once as a SQL query with
GROUP BY and once by fetching all rows and computing means in Python using the
statistics module. Compare the results (they should be identical), compare the code
length and readability, and note any differences in what each approach makes easy
or hard.
Document the queries
For each query on the summary statistics page, write a one-sentence description of what it computes and a one-sentence description of why a researcher would care about that number. Ask the LLM to write the same descriptions and compare them to yours. Note where the LLM's framing differs from a researcher's perspective.
Add confidence intervals
Ask the LLM to add a 95% confidence interval for mean diameter by site to the summary table. Before accepting the output, look up how a confidence interval for a mean is computed. Verify that the LLM's formula is statistically correct; if it is not, identify the error and explain what would need to change.