Data Analysis

Computing summary statistics and aggregates over the survey data

Concepts

What researchers actually want

Aggregation queries in SQL

A summary statistics page

Computed columns

Check for Understanding

Write a SQL query (without running it) that computes the average shell diameter grouped by site name.
SELECT site, AVG(diameter) AS mean_diameter
FROM snails
GROUP BY site
ORDER BY site;

Each row in the result contains a site name and the mean diameter of all measurements at that site. ORDER BY site makes the result deterministic.

What is the difference between COUNT(*) and COUNT(diameter)?

COUNT(*) counts every row in the group, regardless of whether any column is NULL. COUNT(diameter) counts only rows where the diameter column is not NULL. If some measurements have a missing diameter, COUNT(*) will be larger than COUNT(diameter), which matters for computing rates or percentages.

If you compute a mutation rate as mutated_count / total_count, what result do you get in SQLite if both values are integers, and how do you fix it?

SQLite performs integer division, so 3 / 10 returns 0 rather than 0.3. To get a decimal result, cast one value: CAST(mutated_count AS REAL) / total_count or multiply by 1.0: mutated_count * 1.0 / total_count.

When would you pull data into Python and use Polars or NumPy instead of doing the computation in SQL?

When the computation: requires a library function (e.g., scipy.stats.pearsonr for correlations, numpy.percentile for percentiles); involves complex control flow that would be awkward in SQL; or when you need to combine data from multiple queries in non-trivial ways. For simple aggregates (mean, count, sum, max), SQL is almost always the better choice because the database is faster and the query documents the intent.

Exercises

Add a trend query

Ask the LLM to write a query that computes the average shell diameter grouped by survey year. Before accepting the output, write out in plain English what the query should do. Run the query and display the results as a simple HTML table on the summary page.

Compare SQL and Python

Implement the same aggregation—mean diameter by site—twice: once as a SQL query with GROUP BY and once by fetching all rows and computing means in Python using the statistics module. Compare the results (they should be identical), compare the code length and readability, and note any differences in what each approach makes easy or hard.

Document the queries

For each query on the summary statistics page, write a one-sentence description of what it computes and a one-sentence description of why a researcher would care about that number. Ask the LLM to write the same descriptions and compare them to yours. Note where the LLM's framing differs from a researcher's perspective.

Add confidence intervals

Ask the LLM to add a 95% confidence interval for mean diameter by site to the summary table. Before accepting the output, look up how a confidence interval for a mean is computed. Verify that the LLM's formula is statistically correct; if it is not, identify the error and explain what would need to change.