Aggregating and Grouping
Terms defined: aggregation, aggregation function, group
See also "Essential SQL Techniques"
select sum(body_mass_g) as total_mass
from penguins;
| total_mass |
|------------|
| 1437000.0 |
- Aggregation combines many values to produce one
sumis an aggregation function- Combines corresponding values from multiple rows
Common Aggregation Functions
select
max(bill_length_mm) as longest_bill,
min(flipper_length_mm) as shortest_flipper,
avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;
| longest_bill | shortest_flipper | weird_ratio |
|--------------|------------------|------------------|
| 59.6 | 172.0 | 2.56087082530644 |
- This actually shouldn't work: can't calculate maximum or average if any values are null
- SQL does the useful thing instead of the right one
Exercise
What is the average body mass of penguins that weigh more than 3000.0 grams?
Counting
select
count(*) as count_star,
count(sex) as count_specific,
count(distinct sex) as count_distinct
from penguins;
| count_star | count_specific | count_distinct |
|------------|----------------|----------------|
| 344 | 333 | 2 |
count(*)counts rowscount(column)counts non-null entries in columncount(distinct column)counts distinct non-null entries
Exercise
How many different body masses are in the penguins dataset?
Grouping
select avg(body_mass_g) as average_mass_g
from penguins
group by sex;
| average_mass_g |
|------------------|
| 4005.55555555556 |
| 3862.27272727273 |
| 4545.68452380952 |
- Put rows in groups based on distinct combinations of values in columns specified with
group by - Then perform aggregation separately for each group
- But which is which?
Behavior of Unaggregated Columns
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex;
| sex | average_mass_g |
|--------|------------------|
| | 4005.55555555556 |
| FEMALE | 3862.27272727273 |
| MALE | 4545.68452380952 |
- All rows in each group have the same value for
sex, so no need to aggregate
Arbitrary Choice in Aggregation
select
sex,
body_mass_g
from penguins
group by sex;
| sex | body_mass_g |
|--------|-------------|
| | |
| FEMALE | 3800.0 |
| MALE | 3750.0 |
- If we don't specify how to aggregate a column,
SQLite chooses any arbitrary value from the group
- All penguins in each group have the same sex because we grouped by that, so we get the right answer
- The body mass values are in the data but unpredictable
- A common mistake
- Other database managers don't do this
- E.g., PostgreSQL complains that column must be used in an aggregation function
Exercise
Explain why the output of the previous query has a blank line before the rows for female and male penguins.
Write a query that shows each distinct body mass in the penguin dataset and the number of penguins that weigh that much.
Filtering Aggregated Values
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;
| sex | average_mass_g |
|------|------------------|
| | 4005.55555555556 |
| MALE | 4545.68452380952 |
- Using
having conditioninstead ofwhere conditionfor aggregates
Readable Output
select
sex,
round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;
| sex | average_mass_g |
|------|----------------|
| | 4005.6 |
| MALE | 4545.7 |
- Use
round(value, decimals)to round off a number
Filtering Aggregate Inputs
select
sex,
round(
avg(body_mass_g) filter (where body_mass_g < 4000.0),
1
) as average_mass_g
from penguins
group by sex;
| sex | average_mass_g |
|--------|----------------|
| | 3362.5 |
| FEMALE | 3417.3 |
| MALE | 3729.6 |
filter (where condition)applies to inputs
Exercise
Write a query that uses filter to calculate the average body masses
of heavy penguins (those over 4500 grams)
and light penguins (those under 3500 grams)
simultaneously.
Is it possible to do this using where instead of filter?