The Querynomicon

An Introduction to SQL for Weary Data Scientists

cover art by Danielle Navarro

Upon first encountering SQL after two decades of Fortran, C, Java, and Python, I thought I had stumbled into hell. I quickly realized that was optimistic: after all, hell has rules.

I have since realized that SQL does too, and that they are no more confusing or contradictory than those of most other programming languages. They only appear so because it draws on a tradition unfamiliar to those of us raised with derivatives of C. To quote Terry Pratchett, it is not mad, just differently sane.

Welcome, then, to a world in which the strange will become familiar, and the familiar, strange. Welcome, thrice welcome, to SQL.

"The Querynomicon" is a Third Bit production.

What This Is

Scope

Setup

Background Concepts

box and arrow concept map of major concepts related to databases
Figure 1: overview of major concepts

Connecting to Database

src/connect_penguins.sh

sqlite3 db/penguins.db

To disconnect from an interactive database session, type Control-D or .quit on a line of its own. You may need to type a semi-colon ; to close any unfinished query before SQLite will recognize your attempt to escape.

1: Selecting Constant

src/select_1.sql

select 1;

out/select_1.out

1

2: Selecting All Values from Table

src/select_star.sql

select * from little_penguins;

out/select_star.out

Gentoo|Biscoe|51.3|14.2|218.0|5300.0|MALE
Adelie|Dream|35.7|18.0|202.0|3550.0|FEMALE
Adelie|Torgersen|36.6|17.8|185.0|3700.0|FEMALE
Chinstrap|Dream|55.8|19.8|207.0|4000.0|MALE
Adelie|Dream|38.1|18.6|190.0|3700.0|FEMALE
Adelie|Dream|36.2|17.3|187.0|3300.0|FEMALE
Adelie|Dream|39.5|17.8|188.0|3300.0|FEMALE
Gentoo|Biscoe|42.6|13.7|213.0|4950.0|FEMALE
Gentoo|Biscoe|52.1|17.0|230.0|5550.0|MALE
Adelie|Torgersen|36.7|18.8|187.0|3800.0|FEMALE

Administrative Commands

src/admin_commands.sql

.headers on
.mode markdown
select * from little_penguins;

out/admin_commands.out

|  species  |  island   | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g |  sex   |
|-----------|-----------|----------------|---------------|-------------------|-------------|--------|
| Gentoo    | Biscoe    | 51.3           | 14.2          | 218.0             | 5300.0      | MALE   |
| Adelie    | Dream     | 35.7           | 18.0          | 202.0             | 3550.0      | FEMALE |
| Adelie    | Torgersen | 36.6           | 17.8          | 185.0             | 3700.0      | FEMALE |
| Chinstrap | Dream     | 55.8           | 19.8          | 207.0             | 4000.0      | MALE   |
| Adelie    | Dream     | 38.1           | 18.6          | 190.0             | 3700.0      | FEMALE |
| Adelie    | Dream     | 36.2           | 17.3          | 187.0             | 3300.0      | FEMALE |
| Adelie    | Dream     | 39.5           | 17.8          | 188.0             | 3300.0      | FEMALE |
| Gentoo    | Biscoe    | 42.6           | 13.7          | 213.0             | 4950.0      | FEMALE |
| Gentoo    | Biscoe    | 52.1           | 17.0          | 230.0             | 5550.0      | MALE   |
| Adelie    | Torgersen | 36.7           | 18.8          | 187.0             | 3800.0      | FEMALE |

3: Specifying Columns

src/specify_columns.sql

select
    species,
    island,
    sex
from little_penguins;

out/specify_columns.out

|  species  |  island   |  sex   |
|-----------|-----------|--------|
| Gentoo    | Biscoe    | MALE   |
| Adelie    | Dream     | FEMALE |
| Adelie    | Torgersen | FEMALE |
| Chinstrap | Dream     | MALE   |
| Adelie    | Dream     | FEMALE |
| Adelie    | Dream     | FEMALE |
| Adelie    | Dream     | FEMALE |
| Gentoo    | Biscoe    | FEMALE |
| Gentoo    | Biscoe    | MALE   |
| Adelie    | Torgersen | FEMALE |

4: Sorting

src/sort.sql

select
    species,
    sex,
    island
from little_penguins
order by island asc, sex desc;

out/sort.out

|  species  |  sex   |  island   |
|-----------|--------|-----------|
| Gentoo    | MALE   | Biscoe    |
| Gentoo    | MALE   | Biscoe    |
| Gentoo    | FEMALE | Biscoe    |
| Chinstrap | MALE   | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Torgersen |
| Adelie    | FEMALE | Torgersen |

Practice

Exercise 1: Write a SQL query to select the sex and body mass columns from the little_penguins in that order, sorted such that the largest body mass appears first.

5: Limiting Output

src/limit.sql

select
    species,
    sex,
    island
from penguins
order by species, sex, island
limit 10;

out/limit.out

| species |  sex   |  island   |
|---------|--------|-----------|
| Adelie  |        | Dream     |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |

6: Paging Output

src/page.sql

select
    species,
    sex,
    island
from penguins
order by species, sex, island
limit 10 offset 3;

out/page.out

| species |  sex   |  island   |
|---------|--------|-----------|
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |

7: Removing Duplicates

src/distinct.sql

select distinct
    species,
    sex,
    island
from penguins;

out/distinct.out

|  species  |  sex   |  island   |
|-----------|--------|-----------|
| Adelie    | MALE   | Torgersen |
| Adelie    | FEMALE | Torgersen |
| Adelie    |        | Torgersen |
| Adelie    | FEMALE | Biscoe    |
| Adelie    | MALE   | Biscoe    |
| Adelie    | FEMALE | Dream     |
| Adelie    | MALE   | Dream     |
| Adelie    |        | Dream     |
| Chinstrap | FEMALE | Dream     |
| Chinstrap | MALE   | Dream     |
| Gentoo    | FEMALE | Biscoe    |
| Gentoo    | MALE   | Biscoe    |
| Gentoo    |        | Biscoe    |

Practice

Exercise 2: Write a SQL query to select the islands and species from rows 50 to 60 inclusive of the penguins table. Your result should have 11 rows.

Exercise 3: Modify your query to select distinct combinations of island and species from the same rows and compare the result to what you got in part 1.

8: Filtering Results

src/filter.sql

select distinct
    species,
    sex,
    island
from penguins
where island = 'Biscoe';

out/filter.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Adelie  | MALE   | Biscoe |
| Gentoo  | FEMALE | Biscoe |
| Gentoo  | MALE   | Biscoe |
| Gentoo  |        | Biscoe |

Practice

Exercise 4: Write a query to select the body masses from penguins that are less than 3000.0 grams.

Exercise 5: Write another query to select the species and sex of penguins that weight less than 3000.0 grams. This shows that the columns displayed and those used in filtering are independent of each other.

9: Filtering with More Complex Conditions

src/filter_and.sql

select distinct
    species,
    sex,
    island
from penguins
where island = 'Biscoe' and sex != 'MALE';

out/filter_and.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Gentoo  | FEMALE | Biscoe |

Practice

Exercise 6: Use the not operator to select penguins that are not Gentoos.

Exercise 7: SQL’s or is an inclusive or: it succeeds if either or both conditions are true. SQL does not provide a specific operator for exclusive or, which is true if either but not both conditions are true, but the same effect can be achieved using and, or, and not. Write a query to select penguins that are female or on Torgersen Island but not both.

10: Doing Calculations

src/calculations.sql

select
    flipper_length_mm / 10.0,
    body_mass_g / 1000.0
from penguins
limit 3;

out/calculations.out

| flipper_length_mm / 10.0 | body_mass_g / 1000.0 |
|--------------------------|----------------------|
| 18.1                     | 3.75                 |
| 18.6                     | 3.8                  |
| 19.5                     | 3.25                 |

11: Renaming Columns

src/rename_columns.sql

select
    flipper_length_mm / 10.0 as flipper_cm,
    body_mass_g / 1000.0 as weight_kg,
    island as where_found
from penguins
limit 3;

out/rename_columns.out

| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1       | 3.75      | Torgersen   |
| 18.6       | 3.8       | Torgersen   |
| 19.5       | 3.25      | Torgersen   |

Practice

Exercise 8: Write a single query that calculates and returns:

  1. A column called what_where that has the species and island of each penguin separated by a single space.
  2. A column called bill_ratio that has the ratio of bill length to bill depth.

You can use the || operator to concatenate text to solve part 1, or look at the documentation for SQLite’s format() function.

Check Understanding

box and arrow diagram of concepts related to selection
Figure 2: selection

12: Calculating with Missing Values

src/show_missing_values.sql

select
    flipper_length_mm / 10.0 as flipper_cm,
    body_mass_g / 1000.0 as weight_kg,
    island as where_found
from penguins
limit 5;

out/show_missing_values.out

| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1       | 3.75      | Torgersen   |
| 18.6       | 3.8       | Torgersen   |
| 19.5       | 3.25      | Torgersen   |
|            |           | Torgersen   |
| 19.3       | 3.45      | Torgersen   |

Practice

Exercise 9: Use SQLite’s .nullvalue command to change the printed representation of null to the string null and then re-run the previous query. When will displaying null as null be easier to understand? When might it be misleading?

13: Null Equality

src/filter.sql

select distinct
    species,
    sex,
    island
from penguins
where island = 'Biscoe';

out/filter.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Adelie  | MALE   | Biscoe |
| Gentoo  | FEMALE | Biscoe |
| Gentoo  | MALE   | Biscoe |
| Gentoo  |        | Biscoe |

src/null_equality.sql

select distinct
    species,
    sex,
    island
from penguins
where island = 'Biscoe' and sex = 'FEMALE';

out/null_equality.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Gentoo  | FEMALE | Biscoe |

14: Null Inequality

src/null_inequality.sql

select distinct
    species,
    sex,
    island
from penguins
where island = 'Biscoe' and sex != 'FEMALE';

out/null_inequality.out

| species | sex  | island |
|---------|------|--------|
| Adelie  | MALE | Biscoe |
| Gentoo  | MALE | Biscoe |

15: Ternary Logic

src/ternary_logic.sql

select null = null;

out/ternary_logic.out

| null = null |
|-------------|
|             |
equality
X Y null
X true false null
Y false true null
null null null null

16: Handling Null Safely

src/safe_null_equality.sql

select
    species,
    sex,
    island
from penguins
where sex is null;

out/safe_null_equality.out

| species | sex |  island   |
|---------|-----|-----------|
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Dream     |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |

Practice

Exercise 10: Write a query to find penguins whose body mass is known but whose sex is not.

Exercise 11: Write another query to find penguins whose sex is known but whose body mass is not.

Check Understanding

box and arrow diagram of concepts related to null values in SQL
Figure 3: null

17: Aggregating

src/simple_sum.sql

select sum(body_mass_g) as total_mass
from penguins;

out/simple_sum.out

| total_mass |
|------------|
| 1437000.0  |

18: Common Aggregation Functions

src/common_aggregations.sql

select
    max(bill_length_mm) as longest_bill,
    min(flipper_length_mm) as shortest_flipper,
    avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;

out/common_aggregations.out

| longest_bill | shortest_flipper |   weird_ratio    |
|--------------|------------------|------------------|
| 59.6         | 172.0            | 2.56087082530644 |

Practice

Exercise 12: What is the average body mass of penguins that weight more than 3000.0 grams?

19: Counting

src/count_behavior.sql

select
    count(*) as count_star,
    count(sex) as count_specific,
    count(distinct sex) as count_distinct
from penguins;

out/count_behavior.out

| count_star | count_specific | count_distinct |
|------------|----------------|----------------|
| 344        | 333            | 2              |

Practice

Exercise 13: How many different body masses are in the penguins dataset?

20: Grouping

src/simple_group.sql

select avg(body_mass_g) as average_mass_g
from penguins
group by sex;

out/simple_group.out

|  average_mass_g  |
|------------------|
| 4005.55555555556 |
| 3862.27272727273 |
| 4545.68452380952 |

21: Behavior of Unaggregated Columns

src/unaggregated_columns.sql

select
    sex,
    avg(body_mass_g) as average_mass_g
from penguins
group by sex;

out/unaggregated_columns.out

|  sex   |  average_mass_g  |
|--------|------------------|
|        | 4005.55555555556 |
| FEMALE | 3862.27272727273 |
| MALE   | 4545.68452380952 |

22: Arbitrary Choice in Aggregation

src/arbitrary_in_aggregation.sql

select
    sex,
    body_mass_g                   
from penguins
group by sex;

out/arbitrary_in_aggregation.out

|  sex   | body_mass_g |
|--------|-------------|
|        |             |
| FEMALE | 3800.0      |
| MALE   | 3750.0      |

Practice

Exercise 14: Explain why the output of the previous query has a blank line before the rows for female and male penguins.

Exercise 15: Write a query that shows each distinct body mass in the penguin dataset and the number of penguins that weigh that much.

23: Filtering Aggregated Values

src/filter_aggregation.sql

select
    sex,
    avg(body_mass_g) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;

out/filter_aggregation.out

| sex  |  average_mass_g  |
|------|------------------|
|      | 4005.55555555556 |
| MALE | 4545.68452380952 |

24: Readable Output

src/readable_aggregation.sql

select
    sex,
    round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;

out/readable_aggregation.out

| sex  | average_mass_g |
|------|----------------|
|      | 4005.6         |
| MALE | 4545.7         |

25: Filtering Aggregate Inputs

src/filter_aggregate_inputs.sql

select
    sex,
    round(
        avg(body_mass_g) filter (where body_mass_g < 4000.0),
        1
    ) as average_mass_g
from penguins
group by sex;

out/filter_aggregate_inputs.out

|  sex   | average_mass_g |
|--------|----------------|
|        | 3362.5         |
| FEMALE | 3417.3         |
| MALE   | 3729.6         |

Practice

Exercise 16: Write a query that uses filter to calculate the average body masses of heavy penguins (those over 4500 grams) and light penguins (those under 3500 grams) simultaneously. Is it possible to do this using where instead of filter?

Check Understanding

box and arrow diagram of concepts related to aggregation in SQL
Figure 4: aggregation

Creating In-memory Database

src/in_memory_db.sh

sqlite3 :memory:

26: Creating Tables

src/create_work_job.sql

create table job (
    name text not null,
    billable real not null
);
create table work (
    person text not null,
    job text not null
);

27: Inserting Data

src/populate_work_job.sql

insert into job values
('calibrate', 1.5),
('clean', 0.5);
insert into work values
('mik', 'calibrate'),
('mik', 'clean'),
('mik', 'complain'),
('po', 'clean'),
('po', 'complain'),
('tay', 'complain');

out/insert_values.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
| clean     | 0.5      |

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |
| tay    | complain  |

Following Along

Practice

Exercise 17: Using an in-memory database, define a table called notes with two text columns author and note and then add three or four rows. Use a query to check that the notes have been stored and that you can (for example) select by author name.

Exercise 18: What happens if you try to insert too many or too few values into notes? What happens if you insert a number instead of a string into the note field?

28: Updating Rows

src/update_work_job.sql

update work
set person = 'tae'
where person = 'tay';

out/update_rows.out

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |
| tae    | complain  |

29: Deleting Rows

src/delete_rows.sql

delete from work
where person = 'tae';

select * from work;

out/delete_rows.out

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |

Practice

Exercise 19: What happens if you try to delete rows that don’t exist (e.g., all entries in work that refer to juna)?

30: Backing Up

src/backing_up.sql

create table backup (
    person text not null,
    job text not null
);

insert into backup
select
    person,
    job
from work
where person = 'tae';

delete from work
where person = 'tae';

select * from backup;

out/backing_up.out

| person |   job    |
|--------|----------|
| tae    | complain |

Practice

Exercise 20: Saving and restoring data as text:

  1. Re-create the notes table in an in-memory database and then use SQLite’s .output and .dump commands to save the database to a file called notes.sql. Inspect the contents of this file: how has your data been stored?

  2. Start a fresh SQLite session and load notes.sql using the .read command. Inspect the database using .schema and select *: is everything as you expected?

Exercise 21: Saving and restoring data in binary format:

  1. Re-create the notes table in an in-memory database once again and use SQLite’s .backup command to save it to a file called notes.db. Inspect this file using od -c notes.db or a text editor that can handle binary data: how has your data been stored?

  2. Start a fresh SQLite session and load notes.db using the .restore command. Inspect the database using .schema and select *: is everything as you expected?

Check Understanding

box and arrow diagram of concepts relatd to defining and modifying data
Figure 5: data definition and modification

31: Combining Information

src/cross_join.sql

select *
from work cross join job;

out/cross_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | calibrate | clean     | 0.5      |
| mik    | clean     | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| mik    | complain  | calibrate | 1.5      |
| mik    | complain  | clean     | 0.5      |
| po     | clean     | calibrate | 1.5      |
| po     | clean     | clean     | 0.5      |
| po     | complain  | calibrate | 1.5      |
| po     | complain  | clean     | 0.5      |
| tay    | complain  | calibrate | 1.5      |
| tay    | complain  | clean     | 0.5      |

32: Inner Join

src/inner_join.sql

select *
from work inner join job
    on work.job = job.name;

out/inner_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| po     | clean     | clean     | 0.5      |

Practice

Exercise 22: Re-run the query shown above using where job = name instead of the full table.name notation. Is the shortened form easier or harder to read and more or less likely to cause errors?

33: Aggregating Joined Data

src/aggregate_join.sql

select
    work.person,
    sum(job.billable) as pay
from work inner join job
    on work.job = job.name
group by work.person;

out/aggregate_join.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |

34: Left Join

src/left_join.sql

select *
from work left join job
    on work.job = job.name;

out/left_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| mik    | complain  |           |          |
| po     | clean     | clean     | 0.5      |
| po     | complain  |           |          |
| tay    | complain  |           |          |

35: Aggregating Left Joins

src/aggregate_left_join.sql

select
    work.person,
    sum(job.billable) as pay
from work left join job
    on work.job = job.name
group by work.person;

out/aggregate_left_join.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |
| tay    |     |

36: Coalescing Values

src/coalesce.sql

select
    work.person,
    coalesce(sum(job.billable), 0.0) as pay
from work left join job
    on work.job = job.name
group by work.person;

out/coalesce.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |
| tay    | 0.0 |

Full Outer Join

src/full_outer_join.sql

create table size (
    s text not null
);
insert into size values ('light'), ('heavy');

create table weight (
    w text not null
);

select * from size full outer join weight;

out/full_outer_join.out

|   s   | w |
|-------|---|
| light |   |
| heavy |   |

Practice

Exercise 23: Find the least time each person spent on any job. Your output should show that mik and po each spent 0.5 hours on some job. Can you find a way to show the name of the job as well using the SQL you have seen so far?

Check Understanding

box and arrow diagram of concepts related to joining tables
Figure 6: join

37: Negating Incorrectly

src/negate_incorrectly.sql

select distinct person
from work
where job != 'calibrate';

out/negate_incorrectly.out

| person |
|--------|
| mik    |
| po     |
| tay    |

38: Set Membership

src/set_membership.sql

select *
from work
where person not in ('mik', 'tay');

out/set_membership.out

| person |   job    |
|--------|----------|
| po     | clean    |
| po     | complain |

39: Subqueries

src/subquery_set.sql

select distinct person
from work
where person not in (
    select distinct person
    from work
    where job = 'calibrate'
);

out/subquery_set.out

| person |
|--------|
| po     |
| tay    |

Defining a Primary Key

src/primary_key.sql

create table lab_equipment (
    size real not null,
    color text not null,
    num integer not null,
    primary key (size, color)
);

insert into lab_equipment values
(1.5, 'blue', 2),
(1.5, 'green', 1),
(2.5, 'blue', 1);

select * from lab_equipment;

insert into lab_equipment values
(1.5, 'green', 2);

out/primary_key.out

| size | color | num |
|------|-------|-----|
| 1.5  | blue  | 2   |
| 1.5  | green | 1   |
| 2.5  | blue  | 1   |
Runtime error near line 17: UNIQUE constraint failed: lab_equipment.size, lab_equipment.color (19)

Practice

Exercise 24: Does the penguins table have a primary key? If so, what is it? What about the work and job tables?

40: Autoincrementing and Primary Keys

src/autoincrement.sql

create table person (
    ident integer primary key autoincrement,
    name text not null
);
insert into person values
(null, 'mik'),
(null, 'po'),
(null, 'tay');
select * from person;
insert into person values (1, 'prevented');

out/autoincrement.out

| ident | name |
|-------|------|
| 1     | mik  |
| 2     | po   |
| 3     | tay  |
Runtime error near line 12: UNIQUE constraint failed: person.ident (19)

Internal Tables

src/sequence_table.sql

select * from sqlite_sequence;

out/sequence_table.out

|  name  | seq |
|--------|-----|
| person | 3   |

Practice

Exercise 25: Are you able to modify the values stored in sqlite_sequence? In particular, are you able to reset the values so that the same sequence numbers are generated again?

41: Altering Tables

src/alter_tables.sql

alter table job
add ident integer not null default -1;

update job
set ident = 1
where name = 'calibrate';

update job
set ident = 2
where name = 'clean';

select * from job;

out/alter_tables.out

|   name    | billable | ident |
|-----------|----------|-------|
| calibrate | 1.5      | 1     |
| clean     | 0.5      | 2     |

M-to-N Relationships

42: Creating New Tables from Old

src/insert_select.sql

create table new_work (
    person_id integer not null,
    job_id integer not null,
    foreign key (person_id) references person (ident),
    foreign key (job_id) references job (ident)
);

insert into new_work
select
    person.ident as person_id,
    job.ident as job_id
from
    (person inner join work on person.name = work.person)
    inner join job on job.name = work.job;
select * from new_work;

out/insert_select.out

| person_id | job_id |
|-----------|--------|
| 1         | 1      |
| 1         | 2      |
| 2         | 2      |

43: Removing Tables

src/drop_table.sql

drop table work;
alter table new_work rename to work;

out/drop_table.out

CREATE TABLE job (
    ident integer primary key autoincrement,
    name text not null,
    billable real not null
);
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person (
    ident integer primary key autoincrement,
    name text not null
);
CREATE TABLE IF NOT EXISTS "work" (
    person_id integer not null,
    job_id integer not null,
    foreign key(person_id) references person(ident),
    foreign key(job_id) references job(ident)
);

Practice

Exercise 26: Reorganize the penguins database:

  1. Make a copy of the penguins.db file so that your changes won’t affect the original.

  2. Write a SQL script that reorganizes the data into three tables: one for each island.

  3. Why is organizing data like this a bad idea?

Exercise 27: Tools like Sqitch can manage changes to database schemas and data so that they can be saved in version control and rolled back if they are unsuccessful. Translate the changes made by the scripts above into Sqitch. Note: this exercise may take an hour or more.

44: Comparing Individual Values to Aggregates

src/compare_individual_aggregate.sql

select body_mass_g
from penguins
where
    body_mass_g > (
        select avg(body_mass_g)
        from penguins
    )
limit 5;

out/compare_individual_aggregate.out

| body_mass_g |
|-------------|
| 4675.0      |
| 4250.0      |
| 4400.0      |
| 4500.0      |
| 4650.0      |

Practice

Exercise 28: Use a subquery to find the number of penguins that weigh the same as the lightest penguin.

45: Comparing Individual Values to Aggregates Within Groups

src/compare_within_groups.sql

select
    penguins.species,
    penguins.body_mass_g,
    round(averaged.avg_mass_g, 1) as avg_mass_g
from penguins inner join (
    select
        species,
        avg(body_mass_g) as avg_mass_g
    from penguins
    group by species
) as averaged
    on penguins.species = averaged.species
where penguins.body_mass_g > averaged.avg_mass_g
limit 5;

out/compare_within_groups.out

| species | body_mass_g | avg_mass_g |
|---------|-------------|------------|
| Adelie  | 3750.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |
| Adelie  | 4675.0      | 3700.7     |
| Adelie  | 4250.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |

Practice

Exercise 29: Use a subquery to find the number of penguins that weigh the same as the lightest penguin of the same sex and species.

46: Common Table Expressions

src/common_table_expressions.sql

with grouped as (
    select
        species,
        avg(body_mass_g) as avg_mass_g
    from penguins
    group by species
)

select
    penguins.species,
    penguins.body_mass_g,
    round(grouped.avg_mass_g, 1) as avg_mass_g
from penguins inner join grouped
where penguins.body_mass_g > grouped.avg_mass_g
limit 5;

out/common_table_expressions.out

| species | body_mass_g | avg_mass_g |
|---------|-------------|------------|
| Adelie  | 3750.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |
| Adelie  | 4675.0      | 3700.7     |
| Adelie  | 4250.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |

Explaining Query Plans

src/explain_query_plan.sql

explain query plan
select
    species,
    avg(body_mass_g)
from penguins
group by species;

out/explain_query_plan.out

QUERY PLAN
|--SCAN penguins
`--USE TEMP B-TREE FOR GROUP BY

Practice

Exercise 30: Use a CTE to find the number of penguins that weigh the same as the lightest penguin of the same sex and species.

47: Enumerating Rows

src/rowid.sql

select
    rowid,
    species,
    island
from penguins
limit 5;

out/rowid.out

| rowid | species |  island   |
|-------|---------|-----------|
| 1     | Adelie  | Torgersen |
| 2     | Adelie  | Torgersen |
| 3     | Adelie  | Torgersen |
| 4     | Adelie  | Torgersen |
| 5     | Adelie  | Torgersen |

Practice

Exercise 31: To explore how row IDs behave:

  1. Suppose that you create a new table, add three rows, delete those rows, and add the same values again. Do you expect the row IDs of the final rows to be 1–3 or 4–6?

  2. Using an in-memory database, perform the steps in part 1. Was the result what you expected?

48: Conditionals

src/if_else.sql

with sized_penguins as (
    select
        species,
        iif(
            body_mass_g < 3500,
            'small',
            'large'
        ) as size
    from penguins
    where body_mass_g is not null
)

select
    species,
    size,
    count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/if_else.out

|  species  | size  | num |
|-----------|-------|-----|
| Adelie    | small | 54  |
| Adelie    | large | 97  |
| Chinstrap | small | 17  |
| Chinstrap | large | 51  |
| Gentoo    | large | 123 |

Practice

Exercise 32: How does the result of the previous query change if the check for null body mass is removed? Why is the result without that check misleading?

Exercise 33: What does each of the expressions shown below produce? Which ones do you think actually attempt to divide by zero?

  1. iif(0, 123, 1/0)
  2. iif(1, 123, 1/0)
  3. iif(0, 1/0, 123)
  4. iif(1, 1/0, 123)

49: Selecting a Case

src/case_when.sql

with sized_penguins as (
    select
        species,
        case
            when body_mass_g < 3500 then 'small'
            when body_mass_g < 5000 then 'medium'
            else 'large'
        end as size
    from penguins
    where body_mass_g is not null
)

select
    species,
    size,
    count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/case_when.out

|  species  |  size  | num |
|-----------|--------|-----|
| Adelie    | small  | 54  |
| Adelie    | medium | 97  |
| Chinstrap | small  | 17  |
| Chinstrap | medium | 51  |
| Gentoo    | medium | 56  |
| Gentoo    | large  | 67  |

Practice

Exercise 34: Modify the query above so that the outputs are "penguin is small" and "penguin is large" by concatenating the string "penguin is " to the entire case rather than to the individual when branches. (This exercise shows that case/when is an expression rather than a statement.)

50: Checking a Range

src/check_range.sql

with sized_penguins as (
    select
        species,
        case
            when body_mass_g between 3500 and 5000 then 'normal'
            else 'abnormal'
        end as size
    from penguins
    where body_mass_g is not null
)

select
    species,
    size,
    count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/check_range.out

|  species  |   size   | num |
|-----------|----------|-----|
| Adelie    | abnormal | 54  |
| Adelie    | normal   | 97  |
| Chinstrap | abnormal | 17  |
| Chinstrap | normal   | 51  |
| Gentoo    | abnormal | 61  |
| Gentoo    | normal   | 62  |

Practice

Exercise 35: The expression val between 'A' and 'Z' is true if val is 'M' (upper case) but false if val is 'm' (lower case). Rewrite the expression using SQLite’s built-in scalar functions so that it is true in both cases.

name purpose
substr Get substring given starting point and length
trim Remove characters from beginning and end of string
ltrim Remove characters from beginning of string
rtrim Remove characters from end of string
length Length of string
replace Replace occurrences of substring with another string
upper Return upper-case version of string
lower Return lower-case version of string
instr Find location of first occurrence of substring (returns 0 if not found)

Yet Another Database

table-level diagram of assay database showing primary and foreign key relationships
Figure 7: assay database table diagram
entity-relationship diagram showing logical structure of assay database
Figure 8: assay ER diagram

src/assay_staff.sql

select * from staff;

out/assay_staff.out

| ident | personal |  family   | dept | age |
|-------|----------|-----------|------|-----|
| 1     | Kartik   | Gupta     |      | 46  |
| 2     | Divit    | Dhaliwal  | hist | 34  |
| 3     | Indrans  | Sridhar   | mb   | 47  |
| 4     | Pranay   | Khanna    | mb   | 51  |
| 5     | Riaan    | Dua       |      | 23  |
| 6     | Vedika   | Rout      | hist | 45  |
| 7     | Abram    | Chokshi   | gen  | 23  |
| 8     | Romil    | Kapoor    | hist | 38  |
| 9     | Ishaan   | Ramaswamy | mb   | 35  |
| 10    | Nitya    | Lal       | gen  | 52  |

Practice

Exercise 36: Draw a table diagram and an ER diagram to represent the following database:

51: Pattern Matching

src/like_glob.sql

select
    personal,
    family
from staff
where personal like '%ya%';

out/like_glob.out

| personal | family |
|----------|--------|
| Nitya    | Lal    |

Practice

Exercise 37: Rewrite the pattern-matching query shown above using glob.

52: Selecting First and Last Rows

src/union_all.sql

select * from (
    select * from (select * from experiment order by started asc limit 5)
    union all
    select * from (select * from experiment order by started desc limit 5)
)
order by started asc;

out/union_all.out

| ident |    kind     |  started   |   ended    |
|-------|-------------|------------|------------|
| 17    | trial       | 2023-01-29 | 2023-01-30 |
| 35    | calibration | 2023-01-30 | 2023-01-30 |
| 36    | trial       | 2023-02-02 | 2023-02-03 |
| 25    | trial       | 2023-02-12 | 2023-02-14 |
| 2     | calibration | 2023-02-14 | 2023-02-14 |
| 40    | calibration | 2024-01-21 | 2024-01-21 |
| 12    | trial       | 2024-01-26 | 2024-01-28 |
| 44    | trial       | 2024-01-27 | 2024-01-29 |
| 34    | trial       | 2024-02-01 | 2024-02-02 |
| 14    | calibration | 2024-02-03 | 2024-02-03 |

Practice

Exercise 38: Write a query whose result includes two rows for each Adelie penguin in the penguins database. How can you check that your query is working correctly?

53: Intersection

src/intersect.sql

select
    personal,
    family,
    dept,
    age
from staff
where dept = 'mb'
intersect
select
    personal,
    family,
    dept,
    age from staff
where age < 50;

out/intersect.out

| personal |  family   | dept | age |
|----------|-----------|------|-----|
| Indrans  | Sridhar   | mb   | 47  |
| Ishaan   | Ramaswamy | mb   | 35  |

Practice

Exercise 39: Use intersect to find all Adelie penguins that weigh more than 4000 grams. How can you check that your query is working correctly?

Exercise 40: Use explain query plan to compare the intersect-based query you just wrote with one that uses where. Which query looks like it will be more efficient? Why do you believe this?

54: Exclusion

src/except.sql

select
    personal,
    family,
    dept,
    age
from staff
where dept = 'mb'
except
    select
        personal,
        family,
        dept,
        age from staff
    where age < 50;

out/except.out

| personal | family | dept | age |
|----------|--------|------|-----|
| Pranay   | Khanna | mb   | 51  |

Practice

Exercise 41: Use exclude to find all Gentoo penguins that aren’t male. How can you check that your query is working correctly?

55: Random Numbers and Why Not

src/random_numbers.sql

with decorated as (
    select random() as rand,
    personal || ' ' || family as name
    from staff
)

select
    rand,
    abs(rand) % 10 as selector,
    name
from decorated
where selector < 5;

out/random_numbers.out

|         rand         | selector |      name      |
|----------------------|----------|----------------|
| 1175447513360584797  | 6        | Divit Dhaliwal |
| -1544776661445066577 | 5        | Pranay Khanna  |
| -9091755230464281578 | 1        | Riaan Dua      |
| -3436533101074958267 | 6        | Romil Kapoor   |
| -1361395956632904964 | 7        | Nitya Lal      |

Practice

Exercise 42: Write a query that:

56: Creating an Index

src/create_use_index.sql

explain query plan
select filename
from plate
where filename like '%07%';

create index plate_file on plate(filename);

explain query plan
select filename
from plate
where filename like '%07%';

out/create_use_index.out

QUERY PLAN
`--SCAN plate USING COVERING INDEX sqlite_autoindex_plate_1
QUERY PLAN
`--SCAN plate USING COVERING INDEX plate_file

57: Generating Sequences

src/generate_sequence.sql

select value from generate_series(1, 5);

out/generate_sequence.out

| value |
|-------|
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |

58: Generating Sequences Based on Data

src/data_range_sequence.sql

create table temp (
    num integer not null
);
insert into temp values (1), (5);
select value from generate_series (
    (select min(num) from temp),
    (select max(num) from temp)
);

out/data_range_sequence.out

| value |
|-------|
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |

59: Generating Sequences of Dates

src/date_sequence.sql

select date((select julianday(min(started)) from experiment) + value) as some_day
from (
    select value from generate_series(
        (select 0),
        (select julianday(max(started)) - julianday(min(started)) from experiment)
    )
)
limit 5;

out/date_sequence.out

|  some_day  |
|------------|
| 2023-01-29 |
| 2023-01-30 |
| 2023-01-31 |
| 2023-02-01 |
| 2023-02-02 |

60: Counting Experiments Started per Day Without Gaps

src/experiments_per_day.sql

with
-- complete sequence of days with 0 as placeholder for number of experiments
all_days as (
    select
        date((select julianday(min(started)) from experiment) + value) as some_day,
        0 as zeroes
    from (
        select value from generate_series(
            (select 0),
            (select count(*) - 1 from experiment)
        )
    )
),

-- sequence of actual days with actual number of experiments started
actual_days as (
    select
        started,
        count(started) as num_exp
    from experiment
    group by started
)

-- combined by joining on day and taking actual number (if available) or zero
select
    all_days.some_day as day,
    coalesce(actual_days.num_exp, all_days.zeroes) as num_exp
from
    all_days left join actual_days on all_days.some_day = actual_days.started
limit 5;

out/experiments_per_day.out

|    day     | num_exp |
|------------|---------|
| 2023-01-29 | 1       |
| 2023-01-30 | 1       |
| 2023-01-31 | 0       |
| 2023-02-01 | 0       |
| 2023-02-02 | 1       |

Practice

Exercise 43: What does the expression date('now', 'start of month', '+1 month', '-1 day') produce? (You may find the documentation on SQLite’s date and time functions helpful.)

61: Self Join

src/self_join.sql

with person as (
    select
        ident,
        personal || ' ' || family as name
    from staff
)

select
    left_person.name,
    right_person.name
from person as left_person inner join person as right_person
limit 10;

out/self_join.out

|     name     |       name       |
|--------------|------------------|
| Kartik Gupta | Kartik Gupta     |
| Kartik Gupta | Divit Dhaliwal   |
| Kartik Gupta | Indrans Sridhar  |
| Kartik Gupta | Pranay Khanna    |
| Kartik Gupta | Riaan Dua        |
| Kartik Gupta | Vedika Rout      |
| Kartik Gupta | Abram Chokshi    |
| Kartik Gupta | Romil Kapoor     |
| Kartik Gupta | Ishaan Ramaswamy |
| Kartik Gupta | Nitya Lal        |

62: Generating Unique Pairs

src/unique_pairs.sql

with person as (
    select
        ident,
        personal || ' ' || family as name
    from staff
)

select
    left_person.name,
    right_person.name
from person as left_person inner join person as right_person
on left_person.ident < right_person.ident
where left_person.ident <= 4 and right_person.ident <= 4;

out/unique_pairs.out

|      name       |      name       |
|-----------------|-----------------|
| Kartik Gupta    | Divit Dhaliwal  |
| Kartik Gupta    | Indrans Sridhar |
| Kartik Gupta    | Pranay Khanna   |
| Divit Dhaliwal  | Indrans Sridhar |
| Divit Dhaliwal  | Pranay Khanna   |
| Indrans Sridhar | Pranay Khanna   |

63: Filtering Pairs

src/filter_pairs.sql

with
person as (
    select
        ident,
        personal || ' ' || family as name
    from staff
),

together as (
    select
        left_perf.staff as left_staff,
        right_perf.staff as right_staff
    from performed as left_perf inner join performed as right_perf
        on left_perf.experiment = right_perf.experiment
    where left_staff < right_staff
)

select
    left_person.name as person_1,
    right_person.name as person_2
from person as left_person inner join person as right_person join together
    on left_person.ident = left_staff and right_person.ident = right_staff;

out/filter_pairs.out

|    person_1     |     person_2     |
|-----------------|------------------|
| Kartik Gupta    | Vedika Rout      |
| Pranay Khanna   | Vedika Rout      |
| Indrans Sridhar | Romil Kapoor     |
| Abram Chokshi   | Ishaan Ramaswamy |
| Pranay Khanna   | Vedika Rout      |
| Kartik Gupta    | Abram Chokshi    |
| Abram Chokshi   | Romil Kapoor     |
| Kartik Gupta    | Divit Dhaliwal   |
| Divit Dhaliwal  | Abram Chokshi    |
| Pranay Khanna   | Ishaan Ramaswamy |
| Indrans Sridhar | Romil Kapoor     |
| Kartik Gupta    | Ishaan Ramaswamy |
| Kartik Gupta    | Nitya Lal        |
| Kartik Gupta    | Abram Chokshi    |
| Pranay Khanna   | Romil Kapoor     |

64: Existence and Correlated Subqueries

src/correlated_subquery.sql

select
    name,
    building
from department
where
    exists (
        select 1
        from staff
        where dept = department.ident
    )
order by name;

out/correlated_subquery.out

|       name        |     building     |
|-------------------|------------------|
| Genetics          | Chesson          |
| Histology         | Fashet Extension |
| Molecular Biology | Chesson          |

65: Nonexistence

src/nonexistence.sql

select
    name,
    building
from department
where
    not exists (
        select 1
        from staff
        where dept = department.ident
    )
order by name;

out/nonexistence.out

|     name      | building |
|---------------|----------|
| Endocrinology | TGVH     |

Practice

Exercise 44: Can you rewrite the previous query using exclude? If so, is your new query easier to understand? If the query cannot be rewritten, why not?

Avoiding Correlated Subqueries

src/avoid_correlated_subqueries.sql

select distinct
    department.name as name,
    department.building as building
from department inner join staff
    on department.ident = staff.dept
order by name;

out/avoid_correlated_subqueries.out

|       name        |     building     |
|-------------------|------------------|
| Genetics          | Chesson          |
| Histology         | Fashet Extension |
| Molecular Biology | Chesson          |

66: Lead and Lag

src/lead_lag.sql

with ym_num as (
    select
        strftime('%Y-%m', started) as ym,
        count(*) as num
    from experiment
    group by ym
)

select
    ym,
    lag(num) over (order by ym) as prev_num,
    num,
    lead(num) over (order by ym) as next_num
from ym_num
order by ym;

out/lead_lag.out

|   ym    | prev_num | num | next_num |
|---------|----------|-----|----------|
| 2023-01 |          | 2   | 5        |
| 2023-02 | 2        | 5   | 5        |
| 2023-03 | 5        | 5   | 1        |
| 2023-04 | 5        | 1   | 6        |
| 2023-05 | 1        | 6   | 5        |
| 2023-06 | 6        | 5   | 3        |
| 2023-07 | 5        | 3   | 2        |
| 2023-08 | 3        | 2   | 4        |
| 2023-09 | 2        | 4   | 6        |
| 2023-10 | 4        | 6   | 4        |
| 2023-12 | 6        | 4   | 5        |
| 2024-01 | 4        | 5   | 2        |
| 2024-02 | 5        | 2   |          |

Boundaries

67: Windowing Functions

src/window_functions.sql

with ym_num as (
    select
        strftime('%Y-%m', started) as ym,
        count(*) as num
    from experiment
    group by ym
)

select
    ym,
    num,
    sum(num) over (order by ym) as num_done,
    (sum(num) over (order by ym) * 1.00) / (select sum(num) from ym_num) as completed_progress, 
    cume_dist() over (order by ym) as linear_progress
from ym_num
order by ym;

out/window_functions.out

|   ym    | num | num_done | completed_progress |  linear_progress   |
|---------|-----|----------|--------------------|--------------------|
| 2023-01 | 2   | 2        | 0.04               | 0.0769230769230769 |
| 2023-02 | 5   | 7        | 0.14               | 0.153846153846154  |
| 2023-03 | 5   | 12       | 0.24               | 0.230769230769231  |
| 2023-04 | 1   | 13       | 0.26               | 0.307692307692308  |
| 2023-05 | 6   | 19       | 0.38               | 0.384615384615385  |
| 2023-06 | 5   | 24       | 0.48               | 0.461538461538462  |
| 2023-07 | 3   | 27       | 0.54               | 0.538461538461538  |
| 2023-08 | 2   | 29       | 0.58               | 0.615384615384615  |
| 2023-09 | 4   | 33       | 0.66               | 0.692307692307692  |
| 2023-10 | 6   | 39       | 0.78               | 0.769230769230769  |
| 2023-12 | 4   | 43       | 0.86               | 0.846153846153846  |
| 2024-01 | 5   | 48       | 0.96               | 0.923076923076923  |
| 2024-02 | 2   | 50       | 1.0                | 1.0                |

Explaining Another Query Plan

src/explain_window_function.sql

explain query plan
with ym_num as (
    select
        strftime('%Y-%m', started) as ym,
        count(*) as num
    from experiment
    group by ym
)
select
    ym,
    num,
    sum(num) over (order by ym) as num_done,
    cume_dist() over (order by ym) as progress
from ym_num
order by ym;

out/explain_window_function.out

QUERY PLAN
|--CO-ROUTINE (subquery-3)
|  |--CO-ROUTINE (subquery-4)
|  |  |--CO-ROUTINE ym_num
|  |  |  |--SCAN experiment
|  |  |  `--USE TEMP B-TREE FOR GROUP BY
|  |  |--SCAN ym_num
|  |  `--USE TEMP B-TREE FOR ORDER BY
|  `--SCAN (subquery-4)
`--SCAN (subquery-3)

68: Partitioned Windows

src/partition_window.sql

with y_m_num as (
    select
        strftime('%Y', started) as year,
        strftime('%m', started) as month,
        count(*) as num
    from experiment
    group by year, month
)

select
    year,
    month,
    num,
    sum(num) over (partition by year order by month) as num_done
from y_m_num
order by year, month;

out/partition_window.out

| year | month | num | num_done |
|------|-------|-----|----------|
| 2023 | 01    | 2   | 2        |
| 2023 | 02    | 5   | 7        |
| 2023 | 03    | 5   | 12       |
| 2023 | 04    | 1   | 13       |
| 2023 | 05    | 6   | 19       |
| 2023 | 06    | 5   | 24       |
| 2023 | 07    | 3   | 27       |
| 2023 | 08    | 2   | 29       |
| 2023 | 09    | 4   | 33       |
| 2023 | 10    | 6   | 39       |
| 2023 | 12    | 4   | 43       |
| 2024 | 01    | 5   | 5        |
| 2024 | 02    | 2   | 7        |

Practice

Exercise 45: Create a query that:

  1. finds the unique weights of the penguins in the penguins database;

  2. sorts them;

  3. finds the difference between each successive distinct weight; and

  4. counts how many times each unique difference appears.

69: Blobs

src/blob.sql

create table images (
    name text not null,
    content blob
);

insert into images (name, content) values
('biohazard', readfile('res/img/biohazard.png')),
('crush', readfile('res/img/crush.png')),
('fire', readfile('res/img/fire.png')),
('radioactive', readfile('res/img/radioactive.png')),
('tripping', readfile('res/img/tripping.png'));

select
    name,
    length(content)
from images;

out/blob.out

|    name     | length(content) |
|-------------|-----------------|
| biohazard   | 19629           |
| crush       | 15967           |
| fire        | 18699           |
| radioactive | 16661           |
| tripping    | 17208           |

Practice

Exercise 46: Modify the query shown above to select the value of content rather than its length. How intelligible is the output? Does using SQLite’s hex() function make it any more readable?

Yet Another Database

src/lab_log_db.sh

sqlite3 db/lab_log.db

src/lab_log_schema.sql

.schema

out/lab_log_schema.out

CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person(
       ident            integer primary key autoincrement,
       details          text not null
);
CREATE TABLE machine(
       ident            integer primary key autoincrement,
       name             text not null,
       details          text not null
);
CREATE TABLE usage(
       ident            integer primary key autoincrement,
       log              text not null
);

70: Storing JSON

src/json_in_table.sql

select * from machine;

out/json_in_table.out

| ident |      name      |                         details                         |
|-------|----------------|---------------------------------------------------------|
| 1     | WY401          | {"acquired": "2023-05-01"}                              |
| 2     | Inphormex      | {"acquired": "2021-07-15", "refurbished": "2023-10-22"} |
| 3     | AutoPlate 9000 | {"note": "needs software update"}                       |

71: Select Fields from JSON

src/json_field.sql

select
    details->'$.acquired' as single_arrow,
    details->>'$.acquired' as double_arrow
from machine;

out/json_field.out

| single_arrow | double_arrow |
|--------------|--------------|
| "2023-05-01" | 2023-05-01   |
| "2021-07-15" | 2021-07-15   |
|              |              |

Practice

Exercise 47: Write a query that selects the year from the "refurbished" field of the JSON data associated with the Inphormex plate reader.

72: JSON Array Access

src/json_array.sql

select
    ident,
    json_array_length(log->'$') as length,
    log->'$[0]' as first
from usage;

out/json_array.out

| ident | length |                            first                             |
|-------|--------|--------------------------------------------------------------|
| 1     | 4      | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 5      | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 3     | 2      | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |
| 4     | 1      | {"machine":"sterilizer","person":["Maude","Goulet"]}         |
| 5     | 2      | {"machine":"AutoPlate 9000","person":["Brigitte","Michaud"]} |
| 6     | 1      | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 7     | 3      | {"machine":"WY401","person":["Maude","Goulet"]}              |
| 8     | 1      | {"machine":"AutoPlate 9000"}                                 |

73: Unpacking JSON Arrays

src/json_unpack.sql

select
    ident,
    json_each.key as key,
    json_each.value as value
from usage, json_each(usage.log)
limit 10;

out/json_unpack.out

| ident | key |                            value                             |
|-------|-----|--------------------------------------------------------------|
| 1     | 0   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 1   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 2   | {"machine":"WY401","person":["Gabrielle","Dub\u00e9"]}       |
| 1     | 3   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 0   | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 2     | 1   | {"machine":"AutoPlate 9000","person":["Marianne","Richer"]}  |
| 2     | 2   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 2     | 3   | {"machine":"AutoPlate 9000","person":["Monique","Marcotte"]} |
| 2     | 4   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 3     | 0   | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |

Practice

Exercise 48: Write a query that counts how many times each person appears in the first log entry associated with any piece of equipment.

74: Selecting the Last Element of an Array

src/json_array_last.sql

select
    ident,
    log->'$[#-1].machine' as final
from usage
limit 5;

out/json_array_last.out

| ident |    final     |
|-------|--------------|
| 1     | "Inphormex"  |
| 2     | "sterilizer" |
| 3     | "Inphormex"  |
| 4     | "sterilizer" |
| 5     | "sterilizer" |

75: Modifying JSON

src/json_modify.sql

select
    ident,
    name,
    json_set(details, '$.sold', json_quote('2024-01-25')) as updated
from machine;

out/json_modify.out

| ident |      name      |                           updated                            |
|-------|----------------|--------------------------------------------------------------|
| 1     | WY401          | {"acquired":"2023-05-01","sold":"2024-01-25"}                |
| 2     | Inphormex      | {"acquired":"2021-07-15","refurbished":"2023-10-22","sold":" |
|       |                | 2024-01-25"}                                                 |
| 3     | AutoPlate 9000 | {"note":"needs software update","sold":"2024-01-25"}         |

Practice

Exercise 49: As part of cleaning up the lab log database, replace the machine names in the JSON records in usage with the corresopnding machine IDs from the machine table.

Refreshing the Penguins Database

src/count_penguins.sql

select
    species,
    count(*) as num
from penguins
group by species;

out/count_penguins.out

|  species  | num |
|-----------|-----|
| Adelie    | 152 |
| Chinstrap | 68  |
| Gentoo    | 124 |

76: Tombstones

src/make_active.sql

alter table penguins
add active integer not null default 1;

update penguins
set active = iif(species = 'Adelie', 0, 1);

src/active_penguins.sql

select
    species,
    count(*) as num
from penguins
where active
group by species;

out/active_penguins.out

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |

Importing CSV Data

src/create_penguins.sql

drop table if exists penguins;
.mode csv penguins
.import misc/penguins.csv penguins
update penguins set species = null where species = '';
update penguins set island = null where island = '';
update penguins set bill_length_mm = null where bill_length_mm = '';
update penguins set bill_depth_mm = null where bill_depth_mm = '';
update penguins set flipper_length_mm = null where flipper_length_mm = '';
update penguins set body_mass_g = null where body_mass_g = '';
update penguins set sex = null where sex = '';

Practice

Exercise 50: What are the data types of the columns in the penguins table created by the CSV import shown above? How can you correct the ones that need correcting?

77: Views

src/views.sql

create view if not exists
active_penguins (
    species,
    island,
    bill_length_mm,
    bill_depth_mm,
    flipper_length_mm,
    body_mass_g,
    sex
) as
select
    species,
    island,
    bill_length_mm,
    bill_depth_mm,
    flipper_length_mm,
    body_mass_g,
    sex
from penguins
where active;

select
    species,
    count(*) as num
from active_penguins
group by species;

out/views.out

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |

Practice

Exercise 51: Create a view in the lab log database called busy with two columns: machine_id and total_log_length. The first column records the numeric ID of each machine; the second shows the total number of log entries for that machine.

Check Understanding

box and arrow diagram showing different kinds of temporary 'tables' in SQL
Figure 9: temporary tables

Hours Reminder

src/all_jobs.sql

create table job (
    name text not null,
    billable real not null
);
insert into job values
('calibrate', 1.5),
('clean', 0.5);
select * from job;

out/all_jobs.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
| clean     | 0.5      |

78: Adding Checks

src/all_jobs_check.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);
insert into job values ('calibrate', 1.5);
insert into job values ('reset', -0.5);
select * from job;

out/all_jobs_check.out

Runtime error near line 9: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

Practice

Exercise 52: Rewrite the definition of the penguins table to add the following constraints:

  1. body_mass_g must be null or non-negative.

  2. island must be one of "Biscoe", "Dream", or "Torgersen". (Hint: the in operator will be useful here.)

ACID

79: Transactions

src/transaction.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);

insert into job values ('calibrate', 1.5);

begin transaction;
insert into job values ('clean', 0.5);
rollback;

select * from job;

out/transaction.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

80: Rollback in Constraints

src/rollback_constraint.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0) on conflict rollback
);

insert into job values
    ('calibrate', 1.5);
insert into job values
    ('clean', 0.5),
    ('reset', -0.5);

select * from job;

out/rollback_constraint.out

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

81: Rollback in Statements

src/rollback_statement.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);

insert or rollback into job values
('calibrate', 1.5);
insert or rollback into job values
('clean', 0.5),
('reset', -0.5);

select * from job;

out/rollback_statement.out

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

82: Upsert

src/upsert.sql

create table jobs_done (
    person text unique,
    num integer default 0
);

insert into jobs_done values
('zia', 1);
.print 'after first'
select * from jobs_done;
.print


insert into jobs_done values
('zia', 1);
.print 'after failed'
select * from jobs_done;

insert into jobs_done values
('zia', 1)
on conflict(person) do update set num = num + 1;
.print '\nafter upsert'
select * from jobs_done;

out/upsert.out

after first
| person | num |
|--------|-----|
| zia    | 1   |

Runtime error near line 15: UNIQUE constraint failed: jobs_done.person (19)
after failed
| person | num |
|--------|-----|
| zia    | 1   |
\nafter upsert
| person | num |
|--------|-----|
| zia    | 2   |

Practice

Exercise 53: Using the assay database, write a query that adds or modifies people in the staff table as shown:

personal family dept age
Pranay Khanna mb 41
Riaan Dua gen 23
Parth Johel gen 27

Normalization

83: Creating Triggers

src/trigger_setup.sql

-- Track hours of lab work.
create table job (
    person text not null,
    reported real not null check (reported >= 0.0)
);

-- Explicitly store per-person total rather than using sum().
create table total (
    person text unique not null,
    hours real
);

-- Initialize totals.
insert into total values
('gene', 0.0),
('august', 0.0);

-- Define a trigger.
create trigger total_trigger
before insert on job
begin
    -- Check that the person exists.
    select case
        when not exists (select 1 from total where person = new.person)
        then raise(rollback, 'Unknown person ')
    end;
    -- Update their total hours (or fail if non-negative constraint violated).
    update total
    set hours = hours + new.reported
    where total.person = new.person;
end;

84: Trigger Not Firing

src/trigger_successful.sql

insert into job values
('gene', 1.5),
('august', 0.5),
('gene', 1.0);

out/trigger_successful.out

| person | reported |
|--------|----------|
| gene   | 1.5      |
| august | 0.5      |
| gene   | 1.0      |

| person | hours |
|--------|-------|
| gene   | 2.5   |
| august | 0.5   |

85: Trigger Firing

src/trigger_firing.sql

insert into job values
('gene', 1.0),
('august', -1.0);

out/trigger_firing.out

Runtime error near line 6: CHECK constraint failed: reported >= 0.0 (19)

| person | hours |
|--------|-------|
| gene   | 0.0   |
| august | 0.0   |

Practice

Exercise 54: Using the penguins database:

  1. create a table called species with columns name and count; and

  2. define a trigger that increments the count associated with each species each time a new penguin is added to the penguins table.

Does your solution behave correctly when several penguins are added by a single insert statement?

Represent Graphs

src/lineage_setup.sql

create table lineage (
    parent text not null,
    child text not null
);
insert into lineage values
('Arturo', 'Clemente'),
('Darío', 'Clemente'),
('Clemente', 'Homero'),
('Clemente', 'Ivonne'),
('Ivonne', 'Lourdes'),
('Soledad', 'Lourdes'),
('Lourdes', 'Santiago');

src/represent_graph.sql

select * from lineage;

out/represent_graph.out

|  parent  |  child   |
|----------|----------|
| Arturo   | Clemente |
| Darío    | Clemente |
| Clemente | Homero   |
| Clemente | Ivonne   |
| Ivonne   | Lourdes  |
| Soledad  | Lourdes  |
| Lourdes  | Santiago |
box and arrow diagram showing who is descended from whom in the lineage database
Figure 10: lineage diagram

Practice

Exercise 55: Write a query that uses a self join to find every person’s grandchildren.

86: Recursive Queries

src/recursive_lineage.sql

with recursive descendent as (
    select
        'Clemente' as person,
        0 as generations
    union all
    select
        lineage.child as person,
        descendent.generations + 1 as generations
    from descendent inner join lineage
        on descendent.person = lineage.parent
)

select
    person,
    generations
from descendent;

out/recursive_lineage.out

|  person  | generations |
|----------|-------------|
| Clemente | 0           |
| Homero   | 1           |
| Ivonne   | 1           |
| Lourdes  | 2           |
| Santiago | 3           |

Practice

Exercise 56: Modify the recursive query shown above to use union instead of union all. Does this affect the result? Why or why not?

Contact Tracing Database

src/contact_person.sql

select * from person;

out/contact_person.out

| ident |         name          |
|-------|-----------------------|
| 1     | Juana Baeza           |
| 2     | Agustín Rodríquez     |
| 3     | Ariadna Caraballo     |
| 4     | Micaela Laboy         |
| 5     | Verónica Altamirano   |
| 6     | Reina Rivero          |
| 7     | Elias Merino          |
| 8     | Minerva Guerrero      |
| 9     | Mauro Balderas        |
| 10    | Pilar Alarcón         |
| 11    | Daniela Menéndez      |
| 12    | Marco Antonio Barrera |
| 13    | Cristal Soliz         |
| 14    | Bernardo Narváez      |
| 15    | Óscar Barrios         |

src/contact_contacts.sql

select * from contact;

out/contact_contacts.out

|       left        |         right         |
|-------------------|-----------------------|
| Agustín Rodríquez | Ariadna Caraballo     |
| Agustín Rodríquez | Verónica Altamirano   |
| Juana Baeza       | Verónica Altamirano   |
| Juana Baeza       | Micaela Laboy         |
| Pilar Alarcón     | Reina Rivero          |
| Cristal Soliz     | Marco Antonio Barrera |
| Cristal Soliz     | Daniela Menéndez      |
| Daniela Menéndez  | Marco Antonio Barrera |
box and line diagram showing who has had contact with whom
Figure 11: contact diagram

87: Bidirectional Contacts

src/bidirectional.sql

create temporary table bi_contact (
    left text,
    right text
);

insert into bi_contact
select
    left, right from contact
    union all
    select right, left from contact
;

out/bidirectional.out

| original_count |
|----------------|
| 8              |

| num_contact |
|-------------|
| 16          |

88: Updating Group Identifiers

src/update_group_ids.sql

select
    left.name as left_name,
    left.ident as left_ident,
    right.name as right_name,
    right.ident as right_ident,
    min(left.ident, right.ident) as new_ident
from
    (person as left join bi_contact on left.name = bi_contact.left)
    join person as right on bi_contact.right = right.name;

out/update_group_ids.out

|       left_name       | left_ident |      right_name       | right_ident | new_ident |
|-----------------------|------------|-----------------------|-------------|-----------|
| Juana Baeza           | 1          | Micaela Laboy         | 4           | 1         |
| Juana Baeza           | 1          | Verónica Altamirano   | 5           | 1         |
| Agustín Rodríquez     | 2          | Ariadna Caraballo     | 3           | 2         |
| Agustín Rodríquez     | 2          | Verónica Altamirano   | 5           | 2         |
| Ariadna Caraballo     | 3          | Agustín Rodríquez     | 2           | 2         |
| Micaela Laboy         | 4          | Juana Baeza           | 1           | 1         |
| Verónica Altamirano   | 5          | Agustín Rodríquez     | 2           | 2         |
| Verónica Altamirano   | 5          | Juana Baeza           | 1           | 1         |
| Reina Rivero          | 6          | Pilar Alarcón         | 10          | 6         |
| Pilar Alarcón         | 10         | Reina Rivero          | 6           | 6         |
| Daniela Menéndez      | 11         | Cristal Soliz         | 13          | 11        |
| Daniela Menéndez      | 11         | Marco Antonio Barrera | 12          | 11        |
| Marco Antonio Barrera | 12         | Cristal Soliz         | 13          | 12        |
| Marco Antonio Barrera | 12         | Daniela Menéndez      | 11          | 11        |
| Cristal Soliz         | 13         | Daniela Menéndez      | 11          | 11        |
| Cristal Soliz         | 13         | Marco Antonio Barrera | 12          | 12        |

89: Recursive Labeling

src/recursive_labeling.sql

with recursive labeled as (
    select
        person.name as name,
        person.ident as label
    from
        person
    union -- not 'union all'
    select
        person.name as name,
        labeled.label as label
    from
        (person join bi_contact on person.name = bi_contact.left)
        join labeled on bi_contact.right = labeled.name
    where labeled.label < person.ident
)
select name, min(label) as group_id
from labeled
group by name
order by label, name;

out/recursive_labeling.out

|         name          | group_id |
|-----------------------|----------|
| Agustín Rodríquez     | 1        |
| Ariadna Caraballo     | 1        |
| Juana Baeza           | 1        |
| Micaela Laboy         | 1        |
| Verónica Altamirano   | 1        |
| Pilar Alarcón         | 6        |
| Reina Rivero          | 6        |
| Elias Merino          | 7        |
| Minerva Guerrero      | 8        |
| Mauro Balderas        | 9        |
| Cristal Soliz         | 11       |
| Daniela Menéndez      | 11       |
| Marco Antonio Barrera | 11       |
| Bernardo Narváez      | 14       |
| Óscar Barrios         | 15       |

Practice

Exercise 57: Modify the query above to use union all instead of union to trigger infinite recursion. How can you modify the query so that it stops at a certain depth so that you can trace its output?

Check Understanding

box and arrow diagram showing concepts related to common table expressions in SQL
Figure 12: common table expressions

Appendices

Glossary

1-to-1 relation
A relationship between two tables in which each record from the first table matches exactly one record from the second and vice versa.
1-to-many relation
A relationship between two tables in which each record from the first table matches zero or more records from the second, but each record from the second table matches exactly one record from the first.
administration command
A command for managing a database that isn’t part of the SQL standard. Each RDBMS has its own idiosyncratic admin commands.
aggregation
Combining several values to produce one.
aggregation function
A function used to produce one value from many, such as maximum or addition.
alias
An alternate name used temporarily for a table or column.
atomic
An operation that cannot be broken into smaller operations.
autoincrement
Automatically add one to a value.
B-tree
A self-balancing tree data structure that allows search, insertion, and deletion in logarithmic time.
base case
A starting point for recursion that does not depend on previous recursive calculations.
Binary Large Object (blob)
Bytes that are handled as-is rather than being interpreted as numbers, text, or other data types.
comma-separated values (CSV)
A text format for tabular data that uses commas to separate columns.
common table expression (CTE)
A temporary table created at the start of a query, usually to simplify writing the query.
consistent
A state in which all constraints are satisfied, e.g., all columns contain allowed values and all foreign keys refer to primary keys.
correlated subquery
A subquery that depends on a value or values from the enclosing query, and which must therefore be executed once for each of those values.
cross join
A join that creates the cross-product of rows from two tables.
cursor
A reference to the current location in the results of an ongoing query.
data migration
To move data from one form to another, e.g., from one set of tables to a new set or from one DBMS to another.
database
A collection of data that can be searched and retrieved.
database management system (DBMS)
A program that manages a particular kind of database.
denormalization
To deliberately introduce duplication or other violate normal forms, typically to improve query performance.
durable
Guaranteed to survive shutdown and restart.
entity-relationship diagram
A graphical depiction of the relationships between tables in a database.
exclusive or
A Boolean operation that is true if either but not both of its conditions are true. SQL does not provide an exclusive or operator, but the same result can be achieved using operators it has.
expression
A part of a program that produces a value, such as 1+2.
filter
To select records based on whether they pass some Boolean test.
foreign key
A value in one table that identifies a primary key in another table.
full outer join
A join that produces the union of a left outer join and a right outer join.
group
A set of records that share a common property, such as having the same value in a particular column.
in-memory database
A database that is stored in memory rather than on disk.
inclusive or
A Boolean operator that is true if either or both of its conditions are true. SQL’s or is inclusive.
index
An auxiliary data structure that enables faster access to records.
infinite recursion
See “infinite recursion”.
isolated
The appearance of having executed in an otherwise-idle system.
JavaScript Object Notation (JSON)
A text format for representing numbers, strings, lists, and key-value maps.
join
To combine records from two tables.
join condition
The criteria used to decide which rows from each table in a join are combined.
join table
A table that exists solely to enable information from two tables to be connected.
left outer join
A join that is guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with actual values if available or with null otherwise.
many-to-many relation
A relationship between two tables in which each record from the first table may match zero or more records from the second and vice versa.
materialized view
A view that is stored on disk and updated on demand.
normal form
One of several (loosely defined) rules for organizing data in tables.
NoSQL database
Any database that doesn’t use the relational model.
null
A special value representing “not known”.
object-relational mapper (ORM)
A library that translates objects in a program into database queries and the results of those queries back into objects.
path expression
An expression identifying an element or a set of elements in a JSON structure.
primary key
A value or values in a database table that uniquely identifies each record in that table.
query
A command to perform some operation in a database (typically data retrieval).
recursive case
The second or subsequent step in self-referential accumulation of data.
recursive CTE
A common table expression that refers to itself. Every recursive CTE must have a base case and a recursive case.
relational database management system (RDBMS)
A database management system that stores data in tables with columns and rows.
right outer join
A join that is guaranteed to keep all rows from the second (right) table. Columns from the left table are filled with actual values if available or with null otherwise. SQLite does not implement right outer join since its behavior can be reproduced by swapping the order of the tables and using a left outer join.
statement
A part of a program that doesn’t produce a value.
subquery
A query used within another query.
table
A collection of related data in a database stored in columns and rows.
table-valued function
A function that returns multiple values rather than a single value.
temporary table
A table that is explicitly constructed in memory outside any particular query.
ternary logic
A logic based on three values: true, false, and “don’t know” (represented as null).
tombstone
A marker value added to a record to show that it is no longer active. Tombstones are used as an alternative to deleting data.
trigger
An action that runs automatically when something happens in a database, typically insertion or deletion.
Uniform Resource Identifier (URI)
A string that identifies a resource (such as a web page or database) and the protocol used to access it.
upsert
To update a record if it exists or insert (create) a new record if it doesn’t.
vectorization
Performing the same operation on a stream of values rather than using a loop to operate on one value at a time.
view
A rearrangement of data in a database that is regenerated on demand.
window function
A function that combines data from adjacent rows in a database query’s result.

Acknowledgments

This tutorial would not have been possible without:

I would also like to thank the following for spotting issues, making suggestions, or submitting changes:

Links