Data Analysis

Computing summary statistics and aggregates over the survey data

Concepts

GROUP BY collects all rows that share the same value in the specified column into a group
- aggregate functions like AVG, COUNT, SUM, and MAX compute a single value per group rather than per row
COUNT(*) counts all rows in the group (including those with NULL values)
- COUNT(column) counts only rows where that column is not NULL
- an important distinction for data with missing values
Integer division in SQLite: 3 / 2 returns 1, not 1.5
- to get a decimal result, cast at least one operand: CAST(3 AS REAL) / 2
Define the question before writing code
- writing out what you want to know in plain language before asking the LLM to generate SQL produces better, more verifiable queries
When to use Python instead of SQL: if the computation requires libraries (NumPy, scipy), complex iteration, or would be clearer as a Python function
- otherwise prefer SQL: the database is faster and keeps the logic close to the data

What researchers actually want

Mean diameter by site, mutation rate per survey year, correlation between site elevation and shell size
Defining the questions before writing code

Aggregation queries in SQL

GROUP BY, AVG, COUNT, and MAX
Writing and running these queries directly in sqlite3
Reading the results as Python dicts

A summary statistics page

A new route that runs several queries
Rendering the results as an HTML table with a brief interpretation

Computed columns

Deriving mutation rate as a percentage
Doing arithmetic in SQL versus doing it in Python
When to pull data out and use Polars or NumPy instead

Check for Understanding

Write a SQL query (without running it) that computes the average shell diameter grouped by site name.

SELECT site, AVG(diameter) AS mean_diameter
FROM snails
GROUP BY site
ORDER BY site;

Each row in the result contains a site name and the mean diameter of all measurements at that site. ORDER BY site makes the result deterministic.

What is the difference between COUNT(*) and COUNT(diameter)?

COUNT(*) counts every row in the group, regardless of whether any column is NULL. COUNT(diameter) counts only rows where the diameter column is not NULL. If some measurements have a missing diameter, COUNT(*) will be larger than COUNT(diameter), which matters for computing rates or percentages.

If you compute a mutation rate as mutated_count / total_count, what result do you get in SQLite if both values are integers, and how do you fix it?

SQLite performs integer division, so 3 / 10 returns 0 rather than 0.3. To get a decimal result, cast one value: CAST(mutated_count AS REAL) / total_count or multiply by 1.0: mutated_count * 1.0 / total_count.

When would you pull data into Python and use Polars or NumPy instead of doing the computation in SQL?

When the computation: requires a library function (e.g., scipy.stats.pearsonr for correlations, numpy.percentile for percentiles); involves complex control flow that would be awkward in SQL; or when you need to combine data from multiple queries in non-trivial ways. For simple aggregates (mean, count, sum, max), SQL is almost always the better choice because the database is faster and the query documents the intent.

Exercises

Add a trend query

Ask the LLM to write a query that computes the average shell diameter grouped by survey year. Before accepting the output, write out in plain English what the query should do. Run the query and display the results as a simple HTML table on the summary page.

Compare SQL and Python

Implement the same aggregation—mean diameter by site—twice: once as a SQL query with GROUP BY and once by fetching all rows and computing means in Python using the statistics module. Compare the results (they should be identical), compare the code length and readability, and note any differences in what each approach makes easy or hard.

Document the queries

For each query on the summary statistics page, write a one-sentence description of what it computes and a one-sentence description of why a researcher would care about that number. Ask the LLM to write the same descriptions and compare them to yours. Note where the LLM's framing differs from a researcher's perspective.

Add confidence intervals

Ask the LLM to add a 95% confidence interval for mean diameter by site to the summary table. Before accepting the output, look up how a confidence interval for a mean is computed. Verify that the LLM's formula is statistically correct; if it is not, identify the error and explain what would need to change.