# The Querynomicon

An Introduction to SQL for Weary Data Scientists

Upon first encountering SQL after two decades of Fortran, C, Java, and Python, I thought I had stumbled into hell. I quickly realized that was optimistic: after all, hell has rules.

I have since realized that SQL does too, and that they are no more confusing or contradictory than those of most other programming languages. They only appear so because it draws on a tradition unfamiliar to those of us raised with derivatives of C. To quote Terry Pratchett, it is not mad, just differently sane.

Welcome, then, to a world in which the strange will become familiar, and the familiar, strange. Welcome, thrice welcome, to SQL.

"The Querynomicon" is a Third Bit production.

## What This Is

• Notes and working examples that instructors can use to perform a lesson
• Do not expect novices with no prior SQL experience to be able to learn from them
• Musical analogy
• This is the chord changes and melody
• We expect instructors to create an arrangement and/or improvise while delivering
• See Teaching Tech Together for background
• Greg Wilson is a programmer, author, and educator based in Toronto

## Scope

• intended audience
• Rachel has a master’s degree in cell biology and now works in a research hospital doing cell assays.
• She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell.
• Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.
• Her work schedule is unpredictable and highly variable, so she needs to be able to learn a bit at a time.
• prerequisites
• basic Unix command line: cd, ls, * wildcard
• basic tabular data analysis: filtering rows, aggregating within groups
• learning outcomes
1. Explain the difference between a database and a database manager.
2. Write SQL to select, filter, sort, group, and aggregate data.
3. Define tables and insert, update, and delete records.
4. Describe different types of join and write queries that use them to combine data.
5. Use windowing functions to operate on adjacent rows.
6. Explain what transactions are and write queries that roll back when constraints are violated.
7. Explain what triggers are and write SQL to create them.
8. Manipulate JSON data using SQL.
9. Interact with a database using Python directly, from a Jupyter notebook, and via an ORM.

## Setup

• Unzip the file in a temporary directory to create:
• ./db/*.db: the SQLite databases used in the examples
• ./src/*.*: SQL queries, Python scripts, and other source code
• ./out/*.*: expected output for examples

## Background Concepts

• A database is a collection of data that can be searched and retrieved
• A database management system (DBMS) is a program that manages a particular kind of database
• Each DBMS stores data in its own way
• SQLite stores each database in a single file
• PostgreSQL spreads information across many files for higher performance
• DBMS can be a library embedded in other programs (SQLite) or a server (PostgreSQL)
• A relational database management system (RDBMS) stores data in tables and uses SQL for queries
• Unfortunately, every RDBMS has its own dialect of SQL
• There are also NoSQL databases like MongoDB that don’t use tables

## Connecting to Database

src/connect_penguins.sh

sqlite3 db/penguins.db
• Not actually a query: starts an interactive session with the database in db/penguins.db
• Alternative: provide a single query on the command line sqlite3 databasequery
• Or put query in file and run sqlite3 database < filename
• Note: the penguins database contains two tables
• penguins is all the Palmer Penguins data
• little_penguins is a subset used in our first few queries to keep output readable

To disconnect from an interactive database session, type Control-D or .quit on a line of its own. You may need to type a semi-colon ; to close any unfinished query before SQLite will recognize your attempt to escape.

## 1: Selecting Constant

src/select_1.sql

select 1;

out/select_1.out

1
• select is a keyword
• Normally used to select data from table…
• …but if all we want is a constant value, we don’t need to specify one
• Semi-colon terminator is required

## 2: Selecting All Values from Table

src/select_star.sql

select * from little_penguins;

out/select_star.out

Gentoo|Biscoe|51.3|14.2|218.0|5300.0|MALE
Chinstrap|Dream|55.8|19.8|207.0|4000.0|MALE
Gentoo|Biscoe|42.6|13.7|213.0|4950.0|FEMALE
Gentoo|Biscoe|52.1|17.0|230.0|5550.0|MALE
• An actual query
• Use * to mean “all columns”
• Use from tablename to specify table
• Output format is not particularly readable

.mode markdown
select * from little_penguins;

|  species  |  island   | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g |  sex   |
|-----------|-----------|----------------|---------------|-------------------|-------------|--------|
| Gentoo    | Biscoe    | 51.3           | 14.2          | 218.0             | 5300.0      | MALE   |
| Adelie    | Dream     | 35.7           | 18.0          | 202.0             | 3550.0      | FEMALE |
| Adelie    | Torgersen | 36.6           | 17.8          | 185.0             | 3700.0      | FEMALE |
| Chinstrap | Dream     | 55.8           | 19.8          | 207.0             | 4000.0      | MALE   |
| Adelie    | Dream     | 38.1           | 18.6          | 190.0             | 3700.0      | FEMALE |
| Adelie    | Dream     | 36.2           | 17.3          | 187.0             | 3300.0      | FEMALE |
| Adelie    | Dream     | 39.5           | 17.8          | 188.0             | 3300.0      | FEMALE |
| Gentoo    | Biscoe    | 42.6           | 13.7          | 213.0             | 4950.0      | FEMALE |
| Gentoo    | Biscoe    | 52.1           | 17.0          | 230.0             | 5550.0      | MALE   |
| Adelie    | Torgersen | 36.7           | 18.8          | 187.0             | 3800.0      | FEMALE |
• Each command must appear on a line of its own
• Use .help for a complete list
• And as mentioned earlier, use .quit to quit

## 3: Specifying Columns

src/specify_columns.sql

select
species,
island,
sex
from little_penguins;

out/specify_columns.out

|  species  |  island   |  sex   |
|-----------|-----------|--------|
| Gentoo    | Biscoe    | MALE   |
| Adelie    | Dream     | FEMALE |
| Adelie    | Torgersen | FEMALE |
| Chinstrap | Dream     | MALE   |
| Adelie    | Dream     | FEMALE |
| Adelie    | Dream     | FEMALE |
| Adelie    | Dream     | FEMALE |
| Gentoo    | Biscoe    | FEMALE |
| Gentoo    | Biscoe    | MALE   |
| Adelie    | Torgersen | FEMALE |
• Specify column names separated by commas
• In any order
• Duplicates allowed
• Line breaks allowed encouraged for readability

## 4: Sorting

src/sort.sql

select
species,
sex,
island
from little_penguins
order by island asc, sex desc;

out/sort.out

|  species  |  sex   |  island   |
|-----------|--------|-----------|
| Gentoo    | MALE   | Biscoe    |
| Gentoo    | MALE   | Biscoe    |
| Gentoo    | FEMALE | Biscoe    |
| Chinstrap | MALE   | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Dream     |
| Adelie    | FEMALE | Torgersen |
| Adelie    | FEMALE | Torgersen |
• asc is ascending, desc is descending
• Default is ascending, but please specify

## Practice

Exercise 1: Write a SQL query to select the sex and body mass columns from the little_penguins in that order, sorted such that the largest body mass appears first.

## 5: Limiting Output

• Full dataset has 344 rows

src/limit.sql

select
species,
sex,
island
from penguins
order by species, sex, island
limit 10;

out/limit.out

| species |  sex   |  island   |
|---------|--------|-----------|
| Adelie  |        | Dream     |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
• limit N specifies maximum number of rows returned by query

## 6: Paging Output

src/page.sql

select
species,
sex,
island
from penguins
order by species, sex, island
limit 10 offset 3;

out/page.out

| species |  sex   |  island   |
|---------|--------|-----------|
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
• offset N must follow limit
• Specifies number of rows to skip from the start of the selection
• So this query skips the first 3 and shows the next 10

## 7: Removing Duplicates

src/distinct.sql

select distinct
species,
sex,
island
from penguins;

out/distinct.out

|  species  |  sex   |  island   |
|-----------|--------|-----------|
| Adelie    | MALE   | Torgersen |
| Adelie    | FEMALE | Torgersen |
| Adelie    |        | Torgersen |
| Adelie    | FEMALE | Biscoe    |
| Adelie    | MALE   | Biscoe    |
| Adelie    | FEMALE | Dream     |
| Adelie    | MALE   | Dream     |
| Adelie    |        | Dream     |
| Chinstrap | FEMALE | Dream     |
| Chinstrap | MALE   | Dream     |
| Gentoo    | FEMALE | Biscoe    |
| Gentoo    | MALE   | Biscoe    |
| Gentoo    |        | Biscoe    |
• distinct keyword must appear right after select
• SQL was supposed to read like English
• Shows distinct combinations
• Blanks in sex column show missing data

## Practice

Exercise 2: Write a SQL query to select the islands and species from rows 50 to 60 inclusive of the penguins table. Your result should have 11 rows.

Exercise 3: Modify your query to select distinct combinations of island and species from the same rows and compare the result to what you got in part 1.

## 8: Filtering Results

src/filter.sql

select distinct
species,
sex,
island
from penguins
where island = 'Biscoe';

out/filter.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Adelie  | MALE   | Biscoe |
| Gentoo  | FEMALE | Biscoe |
| Gentoo  | MALE   | Biscoe |
| Gentoo  |        | Biscoe |
• where condition filters the rows produced by selection
• Condition is evaluated independently for each row
• Only rows that pass the test appear in results
• Use single quotes for 'text data' and double quotes for "weird column names"
• SQLite will accept double-quoted text data but SQLFluff will complain

## Practice

Exercise 4: Write a query to select the body masses from penguins that are less than 3000.0 grams.

Exercise 5: Write another query to select the species and sex of penguins that weight less than 3000.0 grams. This shows that the columns displayed and those used in filtering are independent of each other.

## 9: Filtering with More Complex Conditions

src/filter_and.sql

select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex != 'MALE';

out/filter_and.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Gentoo  | FEMALE | Biscoe |
• and: both sub-conditions must be true
• or: either or both part must be true
• Notice that the row for Gentoo penguins on Biscoe island with unknown (empty) sex didn’t pass the test

## Practice

Exercise 6: Use the not operator to select penguins that are not Gentoos.

Exercise 7: SQL’s or is an inclusive or: it succeeds if either or both conditions are true. SQL does not provide a specific operator for exclusive or, which is true if either but not both conditions are true, but the same effect can be achieved using and, or, and not. Write a query to select penguins that are female or on Torgersen Island but not both.

## 10: Doing Calculations

src/calculations.sql

select
flipper_length_mm / 10.0,
body_mass_g / 1000.0
from penguins
limit 3;

out/calculations.out

| flipper_length_mm / 10.0 | body_mass_g / 1000.0 |
|--------------------------|----------------------|
| 18.1                     | 3.75                 |
| 18.6                     | 3.8                  |
| 19.5                     | 3.25                 |
• Can do the usual kinds of arithmetic on individual values
• Calculation done for each row independently
• Column name shows the calculation done

## 11: Renaming Columns

src/rename_columns.sql

select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as where_found
from penguins
limit 3;

out/rename_columns.out

| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1       | 3.75      | Torgersen   |
| 18.6       | 3.8       | Torgersen   |
| 19.5       | 3.25      | Torgersen   |
• Use expression as name to rename
• Give result of calculation a meaningful name
• Can also rename columns without modifying

## Practice

Exercise 8: Write a single query that calculates and returns:

1. A column called what_where that has the species and island of each penguin separated by a single space.
2. A column called bill_ratio that has the ratio of bill length to bill depth.

You can use the || operator to concatenate text to solve part 1, or look at the documentation for SQLite’s format() function.

## 12: Calculating with Missing Values

src/show_missing_values.sql

select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as where_found
from penguins
limit 5;

out/show_missing_values.out

| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1       | 3.75      | Torgersen   |
| 18.6       | 3.8       | Torgersen   |
| 19.5       | 3.25      | Torgersen   |
|            |           | Torgersen   |
| 19.3       | 3.45      | Torgersen   |
• SQL uses a special value null to representing missing data
• Not 0 or empty string, but “I don’t know”
• Flipper length and body weight not known for one of the first five penguins
• “I don’t know” divided by 10 or 1000 is “I don’t know”

## Practice

Exercise 9: Use SQLite’s .nullvalue command to change the printed representation of null to the string null and then re-run the previous query. When will displaying null as null be easier to understand? When might it be misleading?

## 13: Null Equality

• Repeated from earlier (so it doesn’t count against our query limit)

src/filter.sql

select distinct
species,
sex,
island
from penguins
where island = 'Biscoe';

out/filter.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Adelie  | MALE   | Biscoe |
| Gentoo  | FEMALE | Biscoe |
| Gentoo  | MALE   | Biscoe |
| Gentoo  |        | Biscoe |
• If we ask for female penguins the row with the missing sex drops out

src/null_equality.sql

select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex = 'FEMALE';

out/null_equality.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Gentoo  | FEMALE | Biscoe |

## 14: Null Inequality

• But if we ask for penguins that aren’t female it drops out as well

src/null_inequality.sql

select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex != 'FEMALE';

out/null_inequality.out

| species | sex  | island |
|---------|------|--------|
| Adelie  | MALE | Biscoe |
| Gentoo  | MALE | Biscoe |

## 15: Ternary Logic

src/ternary_logic.sql

select null = null;

out/ternary_logic.out

| null = null |
|-------------|
|             |
• If we don’t know the left and right values, we don’t know if they’re equal or not
• So the result is null
• Get the same answer for null != null
• Ternary logic
equality
X Y null
X true false null
Y false true null
null null null null

## 16: Handling Null Safely

src/safe_null_equality.sql

select
species,
sex,
island
from penguins
where sex is null;

out/safe_null_equality.out

| species | sex |  island   |
|---------|-----|-----------|
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Dream     |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
• Use is null and is not null to handle null safely
• Other parts of SQL handle nulls specially

## Practice

Exercise 10: Write a query to find penguins whose body mass is known but whose sex is not.

Exercise 11: Write another query to find penguins whose sex is known but whose body mass is not.

## 17: Aggregating

src/simple_sum.sql

select sum(body_mass_g) as total_mass
from penguins;

out/simple_sum.out

| total_mass |
|------------|
| 1437000.0  |

## 18: Common Aggregation Functions

src/common_aggregations.sql

select
max(bill_length_mm) as longest_bill,
min(flipper_length_mm) as shortest_flipper,
avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;

out/common_aggregations.out

| longest_bill | shortest_flipper |   weird_ratio    |
|--------------|------------------|------------------|
| 59.6         | 172.0            | 2.56087082530644 |
• This actually shouldn’t work: can’t calculate maximum or average if any values are null
• SQL does the useful thing instead of the right one

## Practice

Exercise 12: What is the average body mass of penguins that weight more than 3000.0 grams?

## 19: Counting

src/count_behavior.sql

select
count(*) as count_star,
count(sex) as count_specific,
count(distinct sex) as count_distinct
from penguins;

out/count_behavior.out

| count_star | count_specific | count_distinct |
|------------|----------------|----------------|
| 344        | 333            | 2              |
• count(*) counts rows
• count(column) counts non-null entries in column
• count(distinct column) counts distinct non-null entries

## Practice

Exercise 13: How many different body masses are in the penguins dataset?

## 20: Grouping

src/simple_group.sql

select avg(body_mass_g) as average_mass_g
from penguins
group by sex;

out/simple_group.out

|  average_mass_g  |
|------------------|
| 4005.55555555556 |
| 3862.27272727273 |
| 4545.68452380952 |
• Put rows in groups based on distinct combinations of values in columns specified with group by
• Then perform aggregation separately for each group
• But which is which?

## 21: Behavior of Unaggregated Columns

src/unaggregated_columns.sql

select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex;

out/unaggregated_columns.out

|  sex   |  average_mass_g  |
|--------|------------------|
|        | 4005.55555555556 |
| FEMALE | 3862.27272727273 |
| MALE   | 4545.68452380952 |
• All rows in each group have the same value for sex, so no need to aggregate

## 22: Arbitrary Choice in Aggregation

src/arbitrary_in_aggregation.sql

select
sex,
body_mass_g
from penguins
group by sex;

out/arbitrary_in_aggregation.out

|  sex   | body_mass_g |
|--------|-------------|
|        |             |
| FEMALE | 3800.0      |
| MALE   | 3750.0      |
• If we don’t specify how to aggregate a column, SQLite chooses any arbitrary value from the group
• All penguins in each group have the same sex because we grouped by that, so we get the right answer
• The body mass values are in the data but unpredictable
• A common mistake
• Other database managers don’t do this
• E.g., PostgreSQL complains that column must be used in an aggregation function

## Practice

Exercise 14: Explain why the output of the previous query has a blank line before the rows for female and male penguins.

Exercise 15: Write a query that shows each distinct body mass in the penguin dataset and the number of penguins that weigh that much.

## 23: Filtering Aggregated Values

src/filter_aggregation.sql

select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;

out/filter_aggregation.out

| sex  |  average_mass_g  |
|------|------------------|
|      | 4005.55555555556 |
| MALE | 4545.68452380952 |
• Using having condition instead of where condition for aggregates

select
sex,
round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;

| sex  | average_mass_g |
|------|----------------|
|      | 4005.6         |
| MALE | 4545.7         |
• Use round(value, decimals) to round off a number

## 25: Filtering Aggregate Inputs

src/filter_aggregate_inputs.sql

select
sex,
round(
avg(body_mass_g) filter (where body_mass_g < 4000.0),
1
) as average_mass_g
from penguins
group by sex;

out/filter_aggregate_inputs.out

|  sex   | average_mass_g |
|--------|----------------|
|        | 3362.5         |
| FEMALE | 3417.3         |
| MALE   | 3729.6         |
• filter (where condition) applies to inputs

## Practice

Exercise 16: Write a query that uses filter to calculate the average body masses of heavy penguins (those over 4500 grams) and light penguins (those under 3500 grams) simultaneously. Is it possible to do this using where instead of filter?

## Creating In-memory Database

src/in_memory_db.sh

sqlite3 :memory:
• “Connect” to an in-memory database
• Changes aren’t saved to disk
• Very useful for testing (discussed later)

## 26: Creating Tables

src/create_work_job.sql

create table job (
name text not null,
billable real not null
);
create table work (
person text not null,
job text not null
);
• create table name followed by parenthesized list of columns
• Each column is a name, a data type, and optional extra information
• E.g., not null prevents nulls from being added
• .schema is not standard SQL
• SQLite has added a few things
• create if not exists
• upper-case keywords (SQL is case insensitive)

## 27: Inserting Data

src/populate_work_job.sql

insert into job values
('calibrate', 1.5),
('clean', 0.5);
insert into work values
('mik', 'calibrate'),
('mik', 'clean'),
('mik', 'complain'),
('po', 'clean'),
('po', 'complain'),
('tay', 'complain');

out/insert_values.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
| clean     | 0.5      |

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |
| tay    | complain  |

## Following Along

• To re-create this database:

## Practice

Exercise 17: Using an in-memory database, define a table called notes with two text columns author and note and then add three or four rows. Use a query to check that the notes have been stored and that you can (for example) select by author name.

Exercise 18: What happens if you try to insert too many or too few values into notes? What happens if you insert a number instead of a string into the note field?

## 28: Updating Rows

src/update_work_job.sql

update work
set person = 'tae'
where person = 'tay';

out/update_rows.out

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |
| tae    | complain  |
• (Almost) always specify row(s) to update using where
• Otherwise update all rows in table, which is usually not wanted

## 29: Deleting Rows

src/delete_rows.sql

delete from work
where person = 'tae';

select * from work;

out/delete_rows.out

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |
• Again, (almost) always specify row(s) to delete using where

## Practice

Exercise 19: What happens if you try to delete rows that don’t exist (e.g., all entries in work that refer to juna)?

## 30: Backing Up

src/backing_up.sql

create table backup (
person text not null,
job text not null
);

insert into backup
select
person,
job
from work
where person = 'tae';

delete from work
where person = 'tae';

select * from backup;

out/backing_up.out

| person |   job    |
|--------|----------|
| tae    | complain |
• We will explore another strategy based on tombstones below

## Practice

Exercise 20: Saving and restoring data as text:

1. Re-create the notes table in an in-memory database and then use SQLite’s .output and .dump commands to save the database to a file called notes.sql. Inspect the contents of this file: how has your data been stored?

2. Start a fresh SQLite session and load notes.sql using the .read command. Inspect the database using .schema and select *: is everything as you expected?

Exercise 21: Saving and restoring data in binary format:

1. Re-create the notes table in an in-memory database once again and use SQLite’s .backup command to save it to a file called notes.db. Inspect this file using od -c notes.db or a text editor that can handle binary data: how has your data been stored?

2. Start a fresh SQLite session and load notes.db using the .restore command. Inspect the database using .schema and select *: is everything as you expected?

## 31: Combining Information

src/cross_join.sql

select *
from work cross join job;

out/cross_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | calibrate | clean     | 0.5      |
| mik    | clean     | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| mik    | complain  | calibrate | 1.5      |
| mik    | complain  | clean     | 0.5      |
| po     | clean     | calibrate | 1.5      |
| po     | clean     | clean     | 0.5      |
| po     | complain  | calibrate | 1.5      |
| po     | complain  | clean     | 0.5      |
| tay    | complain  | calibrate | 1.5      |
| tay    | complain  | clean     | 0.5      |
• A join combines information from two tables
• cross join constructs their cross product
• All combinations of rows from each
• Result isn’t particularly useful: job and name values don’t match
• I.e., the combined data has records whose parts have nothing to do with each other

## 32: Inner Join

src/inner_join.sql

select *
from work inner join job
on work.job = job.name;

out/inner_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| po     | clean     | clean     | 0.5      |
• Use table.column notation to specify columns
• A column can have the same name as a table
• Use on condition to specify join condition
• Since complain doesn’t appear in job.name, none of those rows are kept

## Practice

Exercise 22: Re-run the query shown above using where job = name instead of the full table.name notation. Is the shortened form easier or harder to read and more or less likely to cause errors?

## 33: Aggregating Joined Data

src/aggregate_join.sql

select
work.person,
sum(job.billable) as pay
from work inner join job
on work.job = job.name
group by work.person;

out/aggregate_join.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |
• Combines ideas we’ve seen before
• But Tay is missing from the table
• No records in the job table with tay as name
• So no records to be grouped and summed

## 34: Left Join

src/left_join.sql

select *
from work left join job
on work.job = job.name;

out/left_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| mik    | complain  |           |          |
| po     | clean     | clean     | 0.5      |
| po     | complain  |           |          |
| tay    | complain  |           |          |
• A left outer join keeps all rows from the left table
• Fills missing values from right table with null

## 35: Aggregating Left Joins

src/aggregate_left_join.sql

select
work.person,
sum(job.billable) as pay
from work left join job
on work.job = job.name
group by work.person;

out/aggregate_left_join.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |
| tay    |     |
• That’s better, but we’d like to see 0 rather than a blank

## 36: Coalescing Values

src/coalesce.sql

select
work.person,
coalesce(sum(job.billable), 0.0) as pay
from work left join job
on work.job = job.name
group by work.person;

out/coalesce.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |
| tay    | 0.0 |
• coalesce(val1, val2, …) returns first non-null value

## Full Outer Join

src/full_outer_join.sql

create table size (
s text not null
);
insert into size values ('light'), ('heavy');

create table weight (
w text not null
);

select * from size full outer join weight;

out/full_outer_join.out

|   s   | w |
|-------|---|
| light |   |
| heavy |   |
• A cross join would produce empty result

## Practice

Exercise 23: Find the least time each person spent on any job. Your output should show that mik and po each spent 0.5 hours on some job. Can you find a way to show the name of the job as well using the SQL you have seen so far?

## 37: Negating Incorrectly

• Who doesn’t calibrate?

src/negate_incorrectly.sql

select distinct person
from work
where job != 'calibrate';

out/negate_incorrectly.out

| person |
|--------|
| mik    |
| po     |
| tay    |
• But Mik does calibrate
• Problem is that there’s an entry for Mik cleaning
• And since 'clean' != 'calibrate', that row is included in the results
• We need a different approach…

## 38: Set Membership

src/set_membership.sql

select *
from work
where person not in ('mik', 'tay');

out/set_membership.out

| person |   job    |
|--------|----------|
| po     | clean    |
| po     | complain |
• in values and not in values do exactly what you expect

## 39: Subqueries

src/subquery_set.sql

select distinct person
from work
where person not in (
select distinct person
from work
where job = 'calibrate'
);

out/subquery_set.out

| person |
|--------|
| po     |
| tay    |
• Use a subquery to select the people who do calibrate
• Then select all the people who aren’t in that set
• Initially feels odd, but subqueries are useful in other ways

## Defining a Primary Key

• Can use any field (or combination of fields) in a table as a primary key as long as value(s) unique for each record
• Uniquely identifies a particular record in a particular table

src/primary_key.sql

create table lab_equipment (
size real not null,
color text not null,
num integer not null,
primary key (size, color)
);

insert into lab_equipment values
(1.5, 'blue', 2),
(1.5, 'green', 1),
(2.5, 'blue', 1);

select * from lab_equipment;

insert into lab_equipment values
(1.5, 'green', 2);

out/primary_key.out

| size | color | num |
|------|-------|-----|
| 1.5  | blue  | 2   |
| 1.5  | green | 1   |
| 2.5  | blue  | 1   |
Runtime error near line 17: UNIQUE constraint failed: lab_equipment.size, lab_equipment.color (19)

## Practice

Exercise 24: Does the penguins table have a primary key? If so, what is it? What about the work and job tables?

## 40: Autoincrementing and Primary Keys

src/autoincrement.sql

create table person (
ident integer primary key autoincrement,
name text not null
);
insert into person values
(null, 'mik'),
(null, 'po'),
(null, 'tay');
select * from person;
insert into person values (1, 'prevented');

out/autoincrement.out

| ident | name |
|-------|------|
| 1     | mik  |
| 2     | po   |
| 3     | tay  |
Runtime error near line 12: UNIQUE constraint failed: person.ident (19)
• Database autoincrements ident each time a new record is added
• Common to use that field as the primary key
• Unique for each record
• If Mik changes their name again, we only have to change one fact in the database
• Downside: manual queries are harder to read (who is person 17?)

## Internal Tables

src/sequence_table.sql

select * from sqlite_sequence;

out/sequence_table.out

|  name  | seq |
|--------|-----|
| person | 3   |
• Sequence numbers are not reset when rows are deleted
• In part so that they can be used as primary keys

## Practice

Exercise 25: Are you able to modify the values stored in sqlite_sequence? In particular, are you able to reset the values so that the same sequence numbers are generated again?

## 41: Altering Tables

src/alter_tables.sql

alter table job
add ident integer not null default -1;

update job
set ident = 1
where name = 'calibrate';

update job
set ident = 2
where name = 'clean';

select * from job;

out/alter_tables.out

|   name    | billable | ident |
|-----------|----------|-------|
| calibrate | 1.5      | 1     |
| clean     | 0.5      | 2     |
• Add a column after the fact
• Since it can’t be null, we have to provide a default value
• Really want to make it the primary key, but SQLite doesn’t allow that after the fact
• Then use update to modify existing records
• Can modify any number of records at once
• So be careful about where clause
• An example of data migration

## M-to-N Relationships

• Relationships between entities are usually characterized as:
• 1-to-1: fields in the same record
• 1-to-many: the many have a foreign key referring to the one’s primary key
• many-to-many: don’t know how many keys to add to records (“maximum” never is)
• Nearly-universal solution is a join table
• Each record is a pair of foreign keys
• I.e., each record is the fact that records A and B are related

## 42: Creating New Tables from Old

src/insert_select.sql

create table new_work (
person_id integer not null,
job_id integer not null,
foreign key (person_id) references person (ident),
foreign key (job_id) references job (ident)
);

insert into new_work
select
person.ident as person_id,
job.ident as job_id
from
(person inner join work on person.name = work.person)
inner join job on job.name = work.job;
select * from new_work;

out/insert_select.out

| person_id | job_id |
|-----------|--------|
| 1         | 1      |
| 1         | 2      |
| 2         | 2      |
• new_work is our join table
• Each column refers to a record in some other table

## 43: Removing Tables

src/drop_table.sql

drop table work;
alter table new_work rename to work;

out/drop_table.out

CREATE TABLE job (
ident integer primary key autoincrement,
name text not null,
billable real not null
);
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person (
ident integer primary key autoincrement,
name text not null
);
CREATE TABLE IF NOT EXISTS "work" (
person_id integer not null,
job_id integer not null,
foreign key(person_id) references person(ident),
foreign key(job_id) references job(ident)
);
• Remove the old table and rename the new one to take its place
• Note if exists

## Practice

Exercise 26: Reorganize the penguins database:

1. Make a copy of the penguins.db file so that your changes won’t affect the original.

2. Write a SQL script that reorganizes the data into three tables: one for each island.

3. Why is organizing data like this a bad idea?

Exercise 27: Tools like Sqitch can manage changes to database schemas and data so that they can be saved in version control and rolled back if they are unsuccessful. Translate the changes made by the scripts above into Sqitch. Note: this exercise may take an hour or more.

## 44: Comparing Individual Values to Aggregates

• Go back to the original penguins database

src/compare_individual_aggregate.sql

select body_mass_g
from penguins
where
body_mass_g > (
select avg(body_mass_g)
from penguins
)
limit 5;

out/compare_individual_aggregate.out

| body_mass_g |
|-------------|
| 4675.0      |
| 4250.0      |
| 4400.0      |
| 4500.0      |
| 4650.0      |
• Get average body mass in subquery
• Compare each row against that
• Requires two scans of the data, but no way to avoid that
• Except calculating a running total each time a penguin is added to the table
• Null values aren’t included in the average or in the final results

## Practice

Exercise 28: Use a subquery to find the number of penguins that weigh the same as the lightest penguin.

## 45: Comparing Individual Values to Aggregates Within Groups

src/compare_within_groups.sql

select
penguins.species,
penguins.body_mass_g,
round(averaged.avg_mass_g, 1) as avg_mass_g
from penguins inner join (
select
species,
avg(body_mass_g) as avg_mass_g
from penguins
group by species
) as averaged
on penguins.species = averaged.species
where penguins.body_mass_g > averaged.avg_mass_g
limit 5;

out/compare_within_groups.out

| species | body_mass_g | avg_mass_g |
|---------|-------------|------------|
| Adelie  | 3750.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |
| Adelie  | 4675.0      | 3700.7     |
| Adelie  | 4250.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |
• Subquery runs first to create temporary table averaged with average mass per species
• Join that with penguins
• Filter to find penguins heavier than average within their species

## Practice

Exercise 29: Use a subquery to find the number of penguins that weigh the same as the lightest penguin of the same sex and species.

## 46: Common Table Expressions

src/common_table_expressions.sql

with grouped as (
select
species,
avg(body_mass_g) as avg_mass_g
from penguins
group by species
)

select
penguins.species,
penguins.body_mass_g,
round(grouped.avg_mass_g, 1) as avg_mass_g
from penguins inner join grouped
where penguins.body_mass_g > grouped.avg_mass_g
limit 5;

out/common_table_expressions.out

| species | body_mass_g | avg_mass_g |
|---------|-------------|------------|
| Adelie  | 3750.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |
| Adelie  | 4675.0      | 3700.7     |
| Adelie  | 4250.0      | 3700.7     |
| Adelie  | 3800.0      | 3700.7     |
• Use common table expression (CTE) to make queries clearer
• Nested subqueries quickly become difficult to understand
• Database decides how to optimize

## Explaining Query Plans

src/explain_query_plan.sql

explain query plan
select
species,
avg(body_mass_g)
from penguins
group by species;

out/explain_query_plan.out

QUERY PLAN
|--SCAN penguins
`--USE TEMP B-TREE FOR GROUP BY
• SQLite plans to scan every row of the table
• It will build a temporary B-tree data structure to group rows

## Practice

Exercise 30: Use a CTE to find the number of penguins that weigh the same as the lightest penguin of the same sex and species.

## 47: Enumerating Rows

• Every table has a special column called rowid

src/rowid.sql

select
rowid,
species,
island
from penguins
limit 5;

out/rowid.out

| rowid | species |  island   |
|-------|---------|-----------|
| 1     | Adelie  | Torgersen |
| 2     | Adelie  | Torgersen |
| 3     | Adelie  | Torgersen |
| 4     | Adelie  | Torgersen |
| 5     | Adelie  | Torgersen |
• rowid is persistent within a session
• I.e., if we delete the first 5 rows we now have row IDs 6…N
• Do not rely on row ID
• In particular, do not use it as a key

## Practice

Exercise 31: To explore how row IDs behave:

1. Suppose that you create a new table, add three rows, delete those rows, and add the same values again. Do you expect the row IDs of the final rows to be 1–3 or 4–6?

2. Using an in-memory database, perform the steps in part 1. Was the result what you expected?

## 48: Conditionals

src/if_else.sql

with sized_penguins as (
select
species,
iif(
body_mass_g < 3500,
'small',
'large'
) as size
from penguins
where body_mass_g is not null
)

select
species,
size,
count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/if_else.out

|  species  | size  | num |
|-----------|-------|-----|
| Adelie    | small | 54  |
| Adelie    | large | 97  |
| Chinstrap | small | 17  |
| Chinstrap | large | 51  |
| Gentoo    | large | 123 |
• iif(condition, true_result, false_result)
• Note: iif with two i’s
• May feel odd to think of if/else as a function, but common in vectorized calculations

## Practice

Exercise 32: How does the result of the previous query change if the check for null body mass is removed? Why is the result without that check misleading?

Exercise 33: What does each of the expressions shown below produce? Which ones do you think actually attempt to divide by zero?

1. iif(0, 123, 1/0)
2. iif(1, 123, 1/0)
3. iif(0, 1/0, 123)
4. iif(1, 1/0, 123)

## 49: Selecting a Case

• What if we want small, medium, and large?
• Can nest iif, but quickly becomes unreadable

src/case_when.sql

with sized_penguins as (
select
species,
case
when body_mass_g < 3500 then 'small'
when body_mass_g < 5000 then 'medium'
else 'large'
end as size
from penguins
where body_mass_g is not null
)

select
species,
size,
count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/case_when.out

|  species  |  size  | num |
|-----------|--------|-----|
| Adelie    | small  | 54  |
| Adelie    | medium | 97  |
| Chinstrap | small  | 17  |
| Chinstrap | medium | 51  |
| Gentoo    | medium | 56  |
| Gentoo    | large  | 67  |
• Evaluate when options in order and take first
• Result of case is null if no condition is true
• Use else as fallback

## Practice

Exercise 34: Modify the query above so that the outputs are "penguin is small" and "penguin is large" by concatenating the string "penguin is " to the entire case rather than to the individual when branches. (This exercise shows that case/when is an expression rather than a statement.)

## 50: Checking a Range

src/check_range.sql

with sized_penguins as (
select
species,
case
when body_mass_g between 3500 and 5000 then 'normal'
else 'abnormal'
end as size
from penguins
where body_mass_g is not null
)

select
species,
size,
count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/check_range.out

|  species  |   size   | num |
|-----------|----------|-----|
| Adelie    | abnormal | 54  |
| Adelie    | normal   | 97  |
| Chinstrap | abnormal | 17  |
| Chinstrap | normal   | 51  |
| Gentoo    | abnormal | 61  |
| Gentoo    | normal   | 62  |
• between can make queries easier to read
• But be careful of the and in the middle

## Practice

Exercise 35: The expression val between 'A' and 'Z' is true if val is 'M' (upper case) but false if val is 'm' (lower case). Rewrite the expression using SQLite’s built-in scalar functions so that it is true in both cases.

name purpose
substr Get substring given starting point and length
trim Remove characters from beginning and end of string
ltrim Remove characters from beginning of string
rtrim Remove characters from end of string
length Length of string
replace Replace occurrences of substring with another string
upper Return upper-case version of string
lower Return lower-case version of string
instr Find location of first occurrence of substring (returns 0 if not found)

## Yet Another Database

• Entity-relationship diagram (ER diagram) shows relationships between tables
• Like everything to do with databases, there are lots of variations

src/assay_staff.sql

select * from staff;

out/assay_staff.out

| ident | personal |  family   | dept | age |
|-------|----------|-----------|------|-----|
| 1     | Kartik   | Gupta     |      | 46  |
| 2     | Divit    | Dhaliwal  | hist | 34  |
| 3     | Indrans  | Sridhar   | mb   | 47  |
| 4     | Pranay   | Khanna    | mb   | 51  |
| 5     | Riaan    | Dua       |      | 23  |
| 6     | Vedika   | Rout      | hist | 45  |
| 7     | Abram    | Chokshi   | gen  | 23  |
| 8     | Romil    | Kapoor    | hist | 38  |
| 9     | Ishaan   | Ramaswamy | mb   | 35  |
| 10    | Nitya    | Lal       | gen  | 52  |

## Practice

Exercise 36: Draw a table diagram and an ER diagram to represent the following database:

• person has id and full_name
• course has id and name
• section has course_id, start_date, and end_date
• instructor has person_id and section_id
• student has person_id, section_id, and status

## 51: Pattern Matching

src/like_glob.sql

select
personal,
family
from staff
where personal like '%ya%';

out/like_glob.out

| personal | family |
|----------|--------|
| Nitya    | Lal    |
• like is the original SQL pattern matcher
• % matches zero or more characters at the start or end of a string
• Case insensitive by default
• glob supports Unix-style wildcards

## Practice

Exercise 37: Rewrite the pattern-matching query shown above using glob.

## 52: Selecting First and Last Rows

src/union_all.sql

select * from (
select * from (select * from experiment order by started asc limit 5)
union all
select * from (select * from experiment order by started desc limit 5)
)
order by started asc;

out/union_all.out

| ident |    kind     |  started   |   ended    |
|-------|-------------|------------|------------|
| 17    | trial       | 2023-01-29 | 2023-01-30 |
| 35    | calibration | 2023-01-30 | 2023-01-30 |
| 36    | trial       | 2023-02-02 | 2023-02-03 |
| 25    | trial       | 2023-02-12 | 2023-02-14 |
| 2     | calibration | 2023-02-14 | 2023-02-14 |
| 40    | calibration | 2024-01-21 | 2024-01-21 |
| 12    | trial       | 2024-01-26 | 2024-01-28 |
| 44    | trial       | 2024-01-27 | 2024-01-29 |
| 34    | trial       | 2024-02-01 | 2024-02-02 |
| 14    | calibration | 2024-02-03 | 2024-02-03 |
• union all combines records
• Keeps duplicates: union on its own only keeps unique records
• Which is more work but sometimes more useful
• Yes, it feels like the extra select * from should be unnecessary

## Practice

Exercise 38: Write a query whose result includes two rows for each Adelie penguin in the penguins database. How can you check that your query is working correctly?

## 53: Intersection

src/intersect.sql

select
personal,
family,
dept,
age
from staff
where dept = 'mb'
intersect
select
personal,
family,
dept,
age from staff
where age < 50;

out/intersect.out

| personal |  family   | dept | age |
|----------|-----------|------|-----|
| Indrans  | Sridhar   | mb   | 47  |
| Ishaan   | Ramaswamy | mb   | 35  |
• Rows involved must have the same structure
• Intersection usually used when pulling values from different sources
• In the query above, would be clearer to use where

## Practice

Exercise 39: Use intersect to find all Adelie penguins that weigh more than 4000 grams. How can you check that your query is working correctly?

Exercise 40: Use explain query plan to compare the intersect-based query you just wrote with one that uses where. Which query looks like it will be more efficient? Why do you believe this?

## 54: Exclusion

src/except.sql

select
personal,
family,
dept,
age
from staff
where dept = 'mb'
except
select
personal,
family,
dept,
age from staff
where age < 50;

out/except.out

| personal | family | dept | age |
|----------|--------|------|-----|
| Pranay   | Khanna | mb   | 51  |
• Again, tables must have same structure
• And this would be clearer with where
• SQL operates on sets, not tables, except where it doesn’t

## Practice

Exercise 41: Use exclude to find all Gentoo penguins that aren’t male. How can you check that your query is working correctly?

## 55: Random Numbers and Why Not

src/random_numbers.sql

with decorated as (
select random() as rand,
personal || ' ' || family as name
from staff
)

select
rand,
abs(rand) % 10 as selector,
name
from decorated
where selector < 5;

out/random_numbers.out

|         rand         | selector |      name      |
|----------------------|----------|----------------|
| 1175447513360584797  | 6        | Divit Dhaliwal |
| -1544776661445066577 | 5        | Pranay Khanna  |
| -9091755230464281578 | 1        | Riaan Dua      |
| -3436533101074958267 | 6        | Romil Kapoor   |
| -1361395956632904964 | 7        | Nitya Lal      |
• There is no way to seed SQLite’s random number generator
• Which means there is no way to reproduce its pseudo-random sequences
• Which means you should never use it
• How are you going to debug something you can’t re-run?

## Practice

Exercise 42: Write a query that:

• uses a CTE to create 1000 random numbers between 0 and 10 inclusive;

• uses a second CTE to calculate their mean; and

• uses a third CTE and SQLite’s built-in math functions to calculate their standard deviation.

## 56: Creating an Index

src/create_use_index.sql

explain query plan
select filename
from plate
where filename like '%07%';

create index plate_file on plate(filename);

explain query plan
select filename
from plate
where filename like '%07%';

out/create_use_index.out

QUERY PLAN
`--SCAN plate USING COVERING INDEX sqlite_autoindex_plate_1
QUERY PLAN
`--SCAN plate USING COVERING INDEX plate_file
• An index is an auxiliary data structure that enables faster access to records
• Spend storage space to buy speed
• Don’t have to mention it explicitly in queries
• Database manager will use it automatically
• Unlike primary keys, SQLite supports defining indexes after the fact

## 57: Generating Sequences

src/generate_sequence.sql

select value from generate_series(1, 5);

out/generate_sequence.out

| value |
|-------|
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |

## 58: Generating Sequences Based on Data

src/data_range_sequence.sql

create table temp (
num integer not null
);
insert into temp values (1), (5);
select value from generate_series (
(select min(num) from temp),
(select max(num) from temp)
);

out/data_range_sequence.out

| value |
|-------|
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |
• Must have the parentheses around the min and max selections to keep SQLite happy

## 59: Generating Sequences of Dates

src/date_sequence.sql

select date((select julianday(min(started)) from experiment) + value) as some_day
from (
select value from generate_series(
(select 0),
(select julianday(max(started)) - julianday(min(started)) from experiment)
)
)
limit 5;

out/date_sequence.out

|  some_day  |
|------------|
| 2023-01-29 |
| 2023-01-30 |
| 2023-01-31 |
| 2023-02-01 |
| 2023-02-02 |
• SQLite represents dates as YYYY-MM-DD strings or as Julian days or as Unix milliseconds or…
• Julian days is fractional number of days since November 24, 4714 BCE
• julianday and date convert back and forth
• julianday is specific to SQLite
• Other databases have their own date handling functions

## 60: Counting Experiments Started per Day Without Gaps

src/experiments_per_day.sql

with
-- complete sequence of days with 0 as placeholder for number of experiments
all_days as (
select
date((select julianday(min(started)) from experiment) + value) as some_day,
0 as zeroes
from (
select value from generate_series(
(select 0),
(select count(*) - 1 from experiment)
)
)
),

-- sequence of actual days with actual number of experiments started
actual_days as (
select
started,
count(started) as num_exp
from experiment
group by started
)

-- combined by joining on day and taking actual number (if available) or zero
select
all_days.some_day as day,
coalesce(actual_days.num_exp, all_days.zeroes) as num_exp
from
all_days left join actual_days on all_days.some_day = actual_days.started
limit 5;

out/experiments_per_day.out

|    day     | num_exp |
|------------|---------|
| 2023-01-29 | 1       |
| 2023-01-30 | 1       |
| 2023-01-31 | 0       |
| 2023-02-01 | 0       |
| 2023-02-02 | 1       |

## Practice

Exercise 43: What does the expression date('now', 'start of month', '+1 month', '-1 day') produce? (You may find the documentation on SQLite’s date and time functions helpful.)

## 61: Self Join

src/self_join.sql

with person as (
select
ident,
personal || ' ' || family as name
from staff
)

select
left_person.name,
right_person.name
from person as left_person inner join person as right_person
limit 10;

out/self_join.out

|     name     |       name       |
|--------------|------------------|
| Kartik Gupta | Kartik Gupta     |
| Kartik Gupta | Divit Dhaliwal   |
| Kartik Gupta | Indrans Sridhar  |
| Kartik Gupta | Pranay Khanna    |
| Kartik Gupta | Riaan Dua        |
| Kartik Gupta | Vedika Rout      |
| Kartik Gupta | Abram Chokshi    |
| Kartik Gupta | Romil Kapoor     |
| Kartik Gupta | Ishaan Ramaswamy |
| Kartik Gupta | Nitya Lal        |
• Join a table to itself
• Use as to create aliases for copies of tables to distinguish them
• Nothing special about the names left and right
• Get all $n2$ pairs, including person with themself

## 62: Generating Unique Pairs

src/unique_pairs.sql

with person as (
select
ident,
personal || ' ' || family as name
from staff
)

select
left_person.name,
right_person.name
from person as left_person inner join person as right_person
on left_person.ident < right_person.ident
where left_person.ident <= 4 and right_person.ident <= 4;

out/unique_pairs.out

|      name       |      name       |
|-----------------|-----------------|
| Kartik Gupta    | Divit Dhaliwal  |
| Kartik Gupta    | Indrans Sridhar |
| Kartik Gupta    | Pranay Khanna   |
| Divit Dhaliwal  | Indrans Sridhar |
| Divit Dhaliwal  | Pranay Khanna   |
| Indrans Sridhar | Pranay Khanna   |
• left.ident < right.ident ensures distinct pairs without duplicates
• Query uses left.ident <= 4 and right.ident <= 4 to shorten output
• Quick check: $n\left(n-1\right)/2$ pairs

## 63: Filtering Pairs

src/filter_pairs.sql

with
person as (
select
ident,
personal || ' ' || family as name
from staff
),

together as (
select
left_perf.staff as left_staff,
right_perf.staff as right_staff
from performed as left_perf inner join performed as right_perf
on left_perf.experiment = right_perf.experiment
where left_staff < right_staff
)

select
left_person.name as person_1,
right_person.name as person_2
from person as left_person inner join person as right_person join together
on left_person.ident = left_staff and right_person.ident = right_staff;

out/filter_pairs.out

|    person_1     |     person_2     |
|-----------------|------------------|
| Kartik Gupta    | Vedika Rout      |
| Pranay Khanna   | Vedika Rout      |
| Indrans Sridhar | Romil Kapoor     |
| Abram Chokshi   | Ishaan Ramaswamy |
| Pranay Khanna   | Vedika Rout      |
| Kartik Gupta    | Abram Chokshi    |
| Abram Chokshi   | Romil Kapoor     |
| Kartik Gupta    | Divit Dhaliwal   |
| Divit Dhaliwal  | Abram Chokshi    |
| Pranay Khanna   | Ishaan Ramaswamy |
| Indrans Sridhar | Romil Kapoor     |
| Kartik Gupta    | Ishaan Ramaswamy |
| Kartik Gupta    | Nitya Lal        |
| Kartik Gupta    | Abram Chokshi    |
| Pranay Khanna   | Romil Kapoor     |

## 64: Existence and Correlated Subqueries

src/correlated_subquery.sql

select
name,
building
from department
where
exists (
select 1
from staff
where dept = department.ident
)
order by name;

out/correlated_subquery.out

|       name        |     building     |
|-------------------|------------------|
| Genetics          | Chesson          |
| Histology         | Fashet Extension |
| Molecular Biology | Chesson          |
• Endocrinology is missing from the list
• select 1 could equally be select true or any other value
• A correlated subquery depends on a value from the outer query
• Equivalent to nested loop

## 65: Nonexistence

src/nonexistence.sql

select
name,
building
from department
where
not exists (
select 1
from staff
where dept = department.ident
)
order by name;

out/nonexistence.out

|     name      | building |
|---------------|----------|
| Endocrinology | TGVH     |

## Practice

Exercise 44: Can you rewrite the previous query using exclude? If so, is your new query easier to understand? If the query cannot be rewritten, why not?

## Avoiding Correlated Subqueries

src/avoid_correlated_subqueries.sql

select distinct
department.name as name,
department.building as building
from department inner join staff
on department.ident = staff.dept
order by name;

out/avoid_correlated_subqueries.out

|       name        |     building     |
|-------------------|------------------|
| Genetics          | Chesson          |
| Histology         | Fashet Extension |
| Molecular Biology | Chesson          |
• The join might or might not be faster than the correlated subquery
• Hard to find unstaffed departments without either not exists or count and a check for 0

with ym_num as (
select
strftime('%Y-%m', started) as ym,
count(*) as num
from experiment
group by ym
)

select
ym,
lag(num) over (order by ym) as prev_num,
num,
lead(num) over (order by ym) as next_num
from ym_num
order by ym;

|   ym    | prev_num | num | next_num |
|---------|----------|-----|----------|
| 2023-01 |          | 2   | 5        |
| 2023-02 | 2        | 5   | 5        |
| 2023-03 | 5        | 5   | 1        |
| 2023-04 | 5        | 1   | 6        |
| 2023-05 | 1        | 6   | 5        |
| 2023-06 | 6        | 5   | 3        |
| 2023-07 | 5        | 3   | 2        |
| 2023-08 | 3        | 2   | 4        |
| 2023-09 | 2        | 4   | 6        |
| 2023-10 | 4        | 6   | 4        |
| 2023-12 | 6        | 4   | 5        |
| 2024-01 | 4        | 5   | 2        |
| 2024-02 | 5        | 2   |          |
• Use strftime to extract year and month
• Clumsy, but date/time handling is not SQLite’s strong point
• Use window functions lead and lag to shift values
• Unavailable values at the top or bottom are null

## 67: Windowing Functions

src/window_functions.sql

with ym_num as (
select
strftime('%Y-%m', started) as ym,
count(*) as num
from experiment
group by ym
)

select
ym,
num,
sum(num) over (order by ym) as num_done,
(sum(num) over (order by ym) * 1.00) / (select sum(num) from ym_num) as completed_progress,
cume_dist() over (order by ym) as linear_progress
from ym_num
order by ym;

out/window_functions.out

|   ym    | num | num_done | completed_progress |  linear_progress   |
|---------|-----|----------|--------------------|--------------------|
| 2023-01 | 2   | 2        | 0.04               | 0.0769230769230769 |
| 2023-02 | 5   | 7        | 0.14               | 0.153846153846154  |
| 2023-03 | 5   | 12       | 0.24               | 0.230769230769231  |
| 2023-04 | 1   | 13       | 0.26               | 0.307692307692308  |
| 2023-05 | 6   | 19       | 0.38               | 0.384615384615385  |
| 2023-06 | 5   | 24       | 0.48               | 0.461538461538462  |
| 2023-07 | 3   | 27       | 0.54               | 0.538461538461538  |
| 2023-08 | 2   | 29       | 0.58               | 0.615384615384615  |
| 2023-09 | 4   | 33       | 0.66               | 0.692307692307692  |
| 2023-10 | 6   | 39       | 0.78               | 0.769230769230769  |
| 2023-12 | 4   | 43       | 0.86               | 0.846153846153846  |
| 2024-01 | 5   | 48       | 0.96               | 0.923076923076923  |
| 2024-02 | 2   | 50       | 1.0                | 1.0                |
• sum() over does a running total
• cume_dist() is fraction of rows seen so far
• So num_done column is number of experiments done…
• completed_progress is the fraction of experiments done…
• …and linear_progress is the fraction of time passed

## Explaining Another Query Plan

src/explain_window_function.sql

explain query plan
with ym_num as (
select
strftime('%Y-%m', started) as ym,
count(*) as num
from experiment
group by ym
)
select
ym,
num,
sum(num) over (order by ym) as num_done,
cume_dist() over (order by ym) as progress
from ym_num
order by ym;

out/explain_window_function.out

QUERY PLAN
|--CO-ROUTINE (subquery-3)
|  |--CO-ROUTINE (subquery-4)
|  |  |--CO-ROUTINE ym_num
|  |  |  |--SCAN experiment
|  |  |  `--USE TEMP B-TREE FOR GROUP BY
|  |  |--SCAN ym_num
|  |  `--USE TEMP B-TREE FOR ORDER BY
|  `--SCAN (subquery-4)
`--SCAN (subquery-3)
• Becomes useful…eventually

## 68: Partitioned Windows

src/partition_window.sql

with y_m_num as (
select
strftime('%Y', started) as year,
strftime('%m', started) as month,
count(*) as num
from experiment
group by year, month
)

select
year,
month,
num,
sum(num) over (partition by year order by month) as num_done
from y_m_num
order by year, month;

out/partition_window.out

| year | month | num | num_done |
|------|-------|-----|----------|
| 2023 | 01    | 2   | 2        |
| 2023 | 02    | 5   | 7        |
| 2023 | 03    | 5   | 12       |
| 2023 | 04    | 1   | 13       |
| 2023 | 05    | 6   | 19       |
| 2023 | 06    | 5   | 24       |
| 2023 | 07    | 3   | 27       |
| 2023 | 08    | 2   | 29       |
| 2023 | 09    | 4   | 33       |
| 2023 | 10    | 6   | 39       |
| 2023 | 12    | 4   | 43       |
| 2024 | 01    | 5   | 5        |
| 2024 | 02    | 2   | 7        |
• partition by creates groups
• So this counts experiments started since the beginning of each year

## Practice

Exercise 45: Create a query that:

1. finds the unique weights of the penguins in the penguins database;

2. sorts them;

3. finds the difference between each successive distinct weight; and

4. counts how many times each unique difference appears.

## 69: Blobs

src/blob.sql

create table images (
name text not null,
content blob
);

insert into images (name, content) values

select
name,
length(content)
from images;

out/blob.out

|    name     | length(content) |
|-------------|-----------------|
| biohazard   | 19629           |
| crush       | 15967           |
| fire        | 18699           |
| tripping    | 17208           |
• A blob is a binary large object
• Bytes in, bytes out…
• If you think that’s odd, check out Fossil

## Practice

Exercise 46: Modify the query shown above to select the value of content rather than its length. How intelligible is the output? Does using SQLite’s hex() function make it any more readable?

## Yet Another Database

src/lab_log_db.sh

sqlite3 db/lab_log.db

src/lab_log_schema.sql

.schema

out/lab_log_schema.out

CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person(
ident            integer primary key autoincrement,
details          text not null
);
CREATE TABLE machine(
ident            integer primary key autoincrement,
name             text not null,
details          text not null
);
CREATE TABLE usage(
ident            integer primary key autoincrement,
log              text not null
);

## 70: Storing JSON

src/json_in_table.sql

select * from machine;

out/json_in_table.out

| ident |      name      |                         details                         |
|-------|----------------|---------------------------------------------------------|
| 1     | WY401          | {"acquired": "2023-05-01"}                              |
| 2     | Inphormex      | {"acquired": "2021-07-15", "refurbished": "2023-10-22"} |
| 3     | AutoPlate 9000 | {"note": "needs software update"}                       |
• Store heterogeneous data as JSON-formatted text (with double-quoted strings)
• Database parses the text each time it is queried, so performance can be an issue
• Can alternatively store as blob (jsonb)
• Can’t view it directly
• But more efficient

## 71: Select Fields from JSON

src/json_field.sql

select
details->'\$.acquired' as single_arrow,
details->>'\$.acquired' as double_arrow
from machine;

out/json_field.out

| single_arrow | double_arrow |
|--------------|--------------|
| "2023-05-01" | 2023-05-01   |
| "2021-07-15" | 2021-07-15   |
|              |              |
• Single arrow -> returns JSON representation of result
• Double arrow ->> returns SQL text, integer, real, or null
• Left side is column
• Right side is path expression
• Fields separated by .

## Practice

Exercise 47: Write a query that selects the year from the "refurbished" field of the JSON data associated with the Inphormex plate reader.

## 72: JSON Array Access

src/json_array.sql

select
ident,
json_array_length(log->'\$') as length,
log->'\$[0]' as first
from usage;

out/json_array.out

| ident | length |                            first                             |
|-------|--------|--------------------------------------------------------------|
| 1     | 4      | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 5      | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 3     | 2      | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |
| 4     | 1      | {"machine":"sterilizer","person":["Maude","Goulet"]}         |
| 5     | 2      | {"machine":"AutoPlate 9000","person":["Brigitte","Michaud"]} |
| 6     | 1      | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 7     | 3      | {"machine":"WY401","person":["Maude","Goulet"]}              |
| 8     | 1      | {"machine":"AutoPlate 9000"}                                 |
• SQLite and other database managers have many JSON manipulation functions
• json_array_length gives number of elements in selected array
• Characters outside 7-bit ASCII represented as Unicode escapes

## 73: Unpacking JSON Arrays

src/json_unpack.sql

select
ident,
json_each.key as key,
json_each.value as value
from usage, json_each(usage.log)
limit 10;

out/json_unpack.out

| ident | key |                            value                             |
|-------|-----|--------------------------------------------------------------|
| 1     | 0   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 1   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 2   | {"machine":"WY401","person":["Gabrielle","Dub\u00e9"]}       |
| 1     | 3   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 0   | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 2     | 1   | {"machine":"AutoPlate 9000","person":["Marianne","Richer"]}  |
| 2     | 2   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 2     | 3   | {"machine":"AutoPlate 9000","person":["Monique","Marcotte"]} |
| 2     | 4   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 3     | 0   | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |
• json_each is another table-valued function
• Use json_each.name to get properties of unpacked array

## Practice

Exercise 48: Write a query that counts how many times each person appears in the first log entry associated with any piece of equipment.

## 74: Selecting the Last Element of an Array

src/json_array_last.sql

select
ident,
log->'\$[#-1].machine' as final
from usage
limit 5;

out/json_array_last.out

| ident |    final     |
|-------|--------------|
| 1     | "Inphormex"  |
| 2     | "sterilizer" |
| 3     | "Inphormex"  |
| 4     | "sterilizer" |
| 5     | "sterilizer" |

## 75: Modifying JSON

src/json_modify.sql

select
ident,
name,
json_set(details, '\$.sold', json_quote('2024-01-25')) as updated
from machine;

out/json_modify.out

| ident |      name      |                           updated                            |
|-------|----------------|--------------------------------------------------------------|
| 1     | WY401          | {"acquired":"2023-05-01","sold":"2024-01-25"}                |
| 2     | Inphormex      | {"acquired":"2021-07-15","refurbished":"2023-10-22","sold":" |
|       |                | 2024-01-25"}                                                 |
| 3     | AutoPlate 9000 | {"note":"needs software update","sold":"2024-01-25"}         |
• Updates the in-memory copy of the JSON, not the database record
• Please use json_quote rather than trying to format JSON with string operations

## Practice

Exercise 49: As part of cleaning up the lab log database, replace the machine names in the JSON records in usage with the corresopnding machine IDs from the machine table.

## Refreshing the Penguins Database

src/count_penguins.sql

select
species,
count(*) as num
from penguins
group by species;

out/count_penguins.out

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |
• We will restore full database after each example

## 76: Tombstones

src/make_active.sql

alter table penguins
add active integer not null default 1;

update penguins
set active = iif(species = 'Adelie', 0, 1);

src/active_penguins.sql

select
species,
count(*) as num
from penguins
where active
group by species;

out/active_penguins.out

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |
• Use a tombstone to mark (in)active records
• Every query must now include it

## Importing CSV Data

• SQLite and most other database managers have tools for importing and exporting CSV
• In SQLite:
• Define table
• Import data
• Convert empty strings to nulls (if desired)
• Convert types from text to whatever (not shown below)

src/create_penguins.sql

drop table if exists penguins;
.mode csv penguins
.import misc/penguins.csv penguins
update penguins set species = null where species = '';
update penguins set island = null where island = '';
update penguins set bill_length_mm = null where bill_length_mm = '';
update penguins set bill_depth_mm = null where bill_depth_mm = '';
update penguins set flipper_length_mm = null where flipper_length_mm = '';
update penguins set body_mass_g = null where body_mass_g = '';
update penguins set sex = null where sex = '';

## Practice

Exercise 50: What are the data types of the columns in the penguins table created by the CSV import shown above? How can you correct the ones that need correcting?

## 77: Views

src/views.sql

create view if not exists
active_penguins (
species,
island,
bill_length_mm,
bill_depth_mm,
flipper_length_mm,
body_mass_g,
sex
) as
select
species,
island,
bill_length_mm,
bill_depth_mm,
flipper_length_mm,
body_mass_g,
sex
from penguins
where active;

select
species,
count(*) as num
from active_penguins
group by species;

out/views.out

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |
• A view is a saved query that other queries can invoke
• View is re-run each time it’s used
• Like a CTE, but:
• Can be shared between queries
• Views came first
• Some databases offer materialized views
• Update-on-demand temporary tables

## Practice

Exercise 51: Create a view in the lab log database called busy with two columns: machine_id and total_log_length. The first column records the numeric ID of each machine; the second shows the total number of log entries for that machine.

## Hours Reminder

src/all_jobs.sql

create table job (
name text not null,
billable real not null
);
insert into job values
('calibrate', 1.5),
('clean', 0.5);
select * from job;

out/all_jobs.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
| clean     | 0.5      |

src/all_jobs_check.sql

create table job (
name text not null,
billable real not null,
check (billable > 0.0)
);
insert into job values ('calibrate', 1.5);
insert into job values ('reset', -0.5);
select * from job;

out/all_jobs_check.out

Runtime error near line 9: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
• check adds constraint to table
• Must produce a Boolean result
• Run each time values added or modified
• But changes made before the error have taken effect

## Practice

Exercise 52: Rewrite the definition of the penguins table to add the following constraints:

1. body_mass_g must be null or non-negative.

2. island must be one of "Biscoe", "Dream", or "Torgersen". (Hint: the in operator will be useful here.)

## ACID

• Atomic: change cannot be broken down into smaller ones (i.e., all or nothing)
• Consistent: database goes from one consistent state to another
• Isolated: looks like changes happened one after another
• Durable: if change takes place, it’s still there after a restart

## 79: Transactions

src/transaction.sql

create table job (
name text not null,
billable real not null,
check (billable > 0.0)
);

insert into job values ('calibrate', 1.5);

begin transaction;
insert into job values ('clean', 0.5);
rollback;

select * from job;

out/transaction.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
• Statements outside transaction execute and are committed immediately
• Statement(s) inside transaction don’t take effect until:
• end transaction (success)
• rollback (undo)
• Can have any number of statements inside a transaction
• But cannot nest transactions in SQLite
• Other databases support this

## 80: Rollback in Constraints

src/rollback_constraint.sql

create table job (
name text not null,
billable real not null,
check (billable > 0.0) on conflict rollback
);

insert into job values
('calibrate', 1.5);
insert into job values
('clean', 0.5),
('reset', -0.5);

select * from job;

out/rollback_constraint.out

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
• All of second insert rolled back as soon as error occurred
• But first insert took effect

## 81: Rollback in Statements

src/rollback_statement.sql

create table job (
name text not null,
billable real not null,
check (billable > 0.0)
);

insert or rollback into job values
('calibrate', 1.5);
insert or rollback into job values
('clean', 0.5),
('reset', -0.5);

select * from job;

out/rollback_statement.out

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
• Constraint is in table definition
• Action is in statement

## 82: Upsert

src/upsert.sql

create table jobs_done (
person text unique,
num integer default 0
);

insert into jobs_done values
('zia', 1);
.print 'after first'
select * from jobs_done;
.print

insert into jobs_done values
('zia', 1);
.print 'after failed'
select * from jobs_done;

insert into jobs_done values
('zia', 1)
on conflict(person) do update set num = num + 1;
.print '\nafter upsert'
select * from jobs_done;

out/upsert.out

after first
| person | num |
|--------|-----|
| zia    | 1   |

Runtime error near line 15: UNIQUE constraint failed: jobs_done.person (19)
after failed
| person | num |
|--------|-----|
| zia    | 1   |
\nafter upsert
| person | num |
|--------|-----|
| zia    | 2   |
• upsert stands for “update or insert”
• Create if record doesn’t exist
• Update if it does
• Not standard SQL but widely implemented
• Example also shows use of SQLite .print command

## Practice

Exercise 53: Using the assay database, write a query that adds or modifies people in the staff table as shown:

personal family dept age
Pranay Khanna mb 41
Riaan Dua gen 23
Parth Johel gen 27

## Normalization

• First normal form (1NF): every field of every record contains one indivisible value.

• Second normal form (2NF) and third normal form (3NF): every value in a record that isn’t a key depends solely on the key, not on other values.

• Denormalization: explicitly store values that could be calculated on the fly

• To simplify queries and/or make processing faster

## 83: Creating Triggers

src/trigger_setup.sql

-- Track hours of lab work.
create table job (
person text not null,
reported real not null check (reported >= 0.0)
);

-- Explicitly store per-person total rather than using sum().
create table total (
person text unique not null,
hours real
);

-- Initialize totals.
insert into total values
('gene', 0.0),
('august', 0.0);

-- Define a trigger.
create trigger total_trigger
before insert on job
begin
-- Check that the person exists.
select case
when not exists (select 1 from total where person = new.person)
then raise(rollback, 'Unknown person ')
end;
-- Update their total hours (or fail if non-negative constraint violated).
update total
set hours = hours + new.reported
where total.person = new.person;
end;
• A trigger automatically runs before or after a specified operation
• Can have side effects (e.g., update some other table)
• And/or implement checks (e.g., make sure other records exist)
• …but data is either cheap or correct, never both
• Inside trigger, refer to old and new versions of record as old.column and new.column

## 84: Trigger Not Firing

src/trigger_successful.sql

insert into job values
('gene', 1.5),
('august', 0.5),
('gene', 1.0);

out/trigger_successful.out

| person | reported |
|--------|----------|
| gene   | 1.5      |
| august | 0.5      |
| gene   | 1.0      |

| person | hours |
|--------|-------|
| gene   | 2.5   |
| august | 0.5   |

## 85: Trigger Firing

src/trigger_firing.sql

insert into job values
('gene', 1.0),
('august', -1.0);

out/trigger_firing.out

Runtime error near line 6: CHECK constraint failed: reported >= 0.0 (19)

| person | hours |
|--------|-------|
| gene   | 0.0   |
| august | 0.0   |

## Practice

Exercise 54: Using the penguins database:

1. create a table called species with columns name and count; and

2. define a trigger that increments the count associated with each species each time a new penguin is added to the penguins table.

Does your solution behave correctly when several penguins are added by a single insert statement?

## Represent Graphs

src/lineage_setup.sql

create table lineage (
parent text not null,
child text not null
);
insert into lineage values
('Arturo', 'Clemente'),
('Darío', 'Clemente'),
('Clemente', 'Homero'),
('Clemente', 'Ivonne'),
('Ivonne', 'Lourdes'),
('Lourdes', 'Santiago');

src/represent_graph.sql

select * from lineage;

out/represent_graph.out

|  parent  |  child   |
|----------|----------|
| Arturo   | Clemente |
| Darío    | Clemente |
| Clemente | Homero   |
| Clemente | Ivonne   |
| Ivonne   | Lourdes  |
| Lourdes  | Santiago |

## Practice

Exercise 55: Write a query that uses a self join to find every person’s grandchildren.

## 86: Recursive Queries

src/recursive_lineage.sql

with recursive descendent as (
select
'Clemente' as person,
0 as generations
union all
select
lineage.child as person,
descendent.generations + 1 as generations
from descendent inner join lineage
on descendent.person = lineage.parent
)

select
person,
generations
from descendent;

out/recursive_lineage.out

|  person  | generations |
|----------|-------------|
| Clemente | 0           |
| Homero   | 1           |
| Ivonne   | 1           |
| Lourdes  | 2           |
| Santiago | 3           |
• Use a recursive CTE to create a temporary table (descendent)
• Base case seeds this table
• Recursive case relies on value(s) already in that table and external table(s)
• union all to combine rows
• Can use union but that has lower performance (must check uniqueness each time)
• Stops when the recursive case yields an empty row set (nothing new to add)
• Then select the desired values from the CTE

## Practice

Exercise 56: Modify the recursive query shown above to use union instead of union all. Does this affect the result? Why or why not?

## Contact Tracing Database

src/contact_person.sql

select * from person;

out/contact_person.out

| ident |         name          |
|-------|-----------------------|
| 1     | Juana Baeza           |
| 2     | Agustín Rodríquez     |
| 3     | Ariadna Caraballo     |
| 4     | Micaela Laboy         |
| 5     | Verónica Altamirano   |
| 6     | Reina Rivero          |
| 7     | Elias Merino          |
| 8     | Minerva Guerrero      |
| 9     | Mauro Balderas        |
| 10    | Pilar Alarcón         |
| 11    | Daniela Menéndez      |
| 12    | Marco Antonio Barrera |
| 13    | Cristal Soliz         |
| 14    | Bernardo Narváez      |
| 15    | Óscar Barrios         |

src/contact_contacts.sql

select * from contact;

out/contact_contacts.out

|       left        |         right         |
|-------------------|-----------------------|
| Agustín Rodríquez | Ariadna Caraballo     |
| Agustín Rodríquez | Verónica Altamirano   |
| Juana Baeza       | Verónica Altamirano   |
| Juana Baeza       | Micaela Laboy         |
| Pilar Alarcón     | Reina Rivero          |
| Cristal Soliz     | Marco Antonio Barrera |
| Cristal Soliz     | Daniela Menéndez      |
| Daniela Menéndez  | Marco Antonio Barrera |

## 87: Bidirectional Contacts

src/bidirectional.sql

create temporary table bi_contact (
left text,
right text
);

insert into bi_contact
select
left, right from contact
union all
select right, left from contact
;

out/bidirectional.out

| original_count |
|----------------|
| 8              |

| num_contact |
|-------------|
| 16          |
• Create a temporary table rather than using a long chain of CTEs
• Only lasts as long as the session (not saved to disk)
• Duplicate information rather than writing more complicated query

## 88: Updating Group Identifiers

src/update_group_ids.sql

select
left.name as left_name,
left.ident as left_ident,
right.name as right_name,
right.ident as right_ident,
min(left.ident, right.ident) as new_ident
from
(person as left join bi_contact on left.name = bi_contact.left)
join person as right on bi_contact.right = right.name;

out/update_group_ids.out

|       left_name       | left_ident |      right_name       | right_ident | new_ident |
|-----------------------|------------|-----------------------|-------------|-----------|
| Juana Baeza           | 1          | Micaela Laboy         | 4           | 1         |
| Juana Baeza           | 1          | Verónica Altamirano   | 5           | 1         |
| Agustín Rodríquez     | 2          | Ariadna Caraballo     | 3           | 2         |
| Agustín Rodríquez     | 2          | Verónica Altamirano   | 5           | 2         |
| Ariadna Caraballo     | 3          | Agustín Rodríquez     | 2           | 2         |
| Micaela Laboy         | 4          | Juana Baeza           | 1           | 1         |
| Verónica Altamirano   | 5          | Agustín Rodríquez     | 2           | 2         |
| Verónica Altamirano   | 5          | Juana Baeza           | 1           | 1         |
| Reina Rivero          | 6          | Pilar Alarcón         | 10          | 6         |
| Pilar Alarcón         | 10         | Reina Rivero          | 6           | 6         |
| Daniela Menéndez      | 11         | Cristal Soliz         | 13          | 11        |
| Daniela Menéndez      | 11         | Marco Antonio Barrera | 12          | 11        |
| Marco Antonio Barrera | 12         | Cristal Soliz         | 13          | 12        |
| Marco Antonio Barrera | 12         | Daniela Menéndez      | 11          | 11        |
| Cristal Soliz         | 13         | Daniela Menéndez      | 11          | 11        |
| Cristal Soliz         | 13         | Marco Antonio Barrera | 12          | 12        |
• new_ident is minimum of own identifier and identifiers one step away
• Doesn’t keep people with no contacts

## 89: Recursive Labeling

src/recursive_labeling.sql

with recursive labeled as (
select
person.name as name,
person.ident as label
from
person
union -- not 'union all'
select
person.name as name,
labeled.label as label
from
(person join bi_contact on person.name = bi_contact.left)
join labeled on bi_contact.right = labeled.name
where labeled.label < person.ident
)
select name, min(label) as group_id
from labeled
group by name
order by label, name;

out/recursive_labeling.out

|         name          | group_id |
|-----------------------|----------|
| Agustín Rodríquez     | 1        |
| Ariadna Caraballo     | 1        |
| Juana Baeza           | 1        |
| Micaela Laboy         | 1        |
| Verónica Altamirano   | 1        |
| Pilar Alarcón         | 6        |
| Reina Rivero          | 6        |
| Elias Merino          | 7        |
| Minerva Guerrero      | 8        |
| Mauro Balderas        | 9        |
| Cristal Soliz         | 11       |
| Daniela Menéndez      | 11       |
| Marco Antonio Barrera | 11       |
| Bernardo Narváez      | 14       |
| Óscar Barrios         | 15       |

## Practice

Exercise 57: Modify the query above to use union all instead of union to trigger infinite recursion. How can you modify the query so that it stops at a certain depth so that you can trace its output?

## Appendices

### Glossary

1-to-1 relation
A relationship between two tables in which each record from the first table matches exactly one record from the second and vice versa.
1-to-many relation
A relationship between two tables in which each record from the first table matches zero or more records from the second, but each record from the second table matches exactly one record from the first.
A command for managing a database that isn’t part of the SQL standard. Each RDBMS has its own idiosyncratic admin commands.
aggregation
Combining several values to produce one.
aggregation function
A function used to produce one value from many, such as maximum or addition.
alias
An alternate name used temporarily for a table or column.
atomic
An operation that cannot be broken into smaller operations.
autoincrement
Automatically add one to a value.
B-tree
A self-balancing tree data structure that allows search, insertion, and deletion in logarithmic time.
base case
A starting point for recursion that does not depend on previous recursive calculations.
Binary Large Object (blob)
Bytes that are handled as-is rather than being interpreted as numbers, text, or other data types.
comma-separated values (CSV)
A text format for tabular data that uses commas to separate columns.
common table expression (CTE)
A temporary table created at the start of a query, usually to simplify writing the query.
consistent
A state in which all constraints are satisfied, e.g., all columns contain allowed values and all foreign keys refer to primary keys.
correlated subquery
A subquery that depends on a value or values from the enclosing query, and which must therefore be executed once for each of those values.
cross join
A join that creates the cross-product of rows from two tables.
cursor
A reference to the current location in the results of an ongoing query.
data migration
To move data from one form to another, e.g., from one set of tables to a new set or from one DBMS to another.
database
A collection of data that can be searched and retrieved.
database management system (DBMS)
A program that manages a particular kind of database.
denormalization
To deliberately introduce duplication or other violate normal forms, typically to improve query performance.
durable
Guaranteed to survive shutdown and restart.
entity-relationship diagram
A graphical depiction of the relationships between tables in a database.
exclusive or
A Boolean operation that is true if either but not both of its conditions are true. SQL does not provide an exclusive or operator, but the same result can be achieved using operators it has.
expression
A part of a program that produces a value, such as 1+2.
filter
To select records based on whether they pass some Boolean test.
foreign key
A value in one table that identifies a primary key in another table.
full outer join
A join that produces the union of a left outer join and a right outer join.
group
A set of records that share a common property, such as having the same value in a particular column.
in-memory database
A database that is stored in memory rather than on disk.
inclusive or
A Boolean operator that is true if either or both of its conditions are true. SQL’s or is inclusive.
index
infinite recursion
See “infinite recursion”.
isolated
The appearance of having executed in an otherwise-idle system.
JavaScript Object Notation (JSON)
A text format for representing numbers, strings, lists, and key-value maps.
join
To combine records from two tables.
join condition
The criteria used to decide which rows from each table in a join are combined.
join table
A table that exists solely to enable information from two tables to be connected.
left outer join
A join that is guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with actual values if available or with null otherwise.
many-to-many relation
A relationship between two tables in which each record from the first table may match zero or more records from the second and vice versa.
materialized view
A view that is stored on disk and updated on demand.
normal form
One of several (loosely defined) rules for organizing data in tables.
NoSQL database
Any database that doesn’t use the relational model.
null
A special value representing “not known”.
object-relational mapper (ORM)
A library that translates objects in a program into database queries and the results of those queries back into objects.
path expression
An expression identifying an element or a set of elements in a JSON structure.
primary key
A value or values in a database table that uniquely identifies each record in that table.
query
A command to perform some operation in a database (typically data retrieval).
recursive case
The second or subsequent step in self-referential accumulation of data.
recursive CTE
A common table expression that refers to itself. Every recursive CTE must have a base case and a recursive case.
relational database management system (RDBMS)
A database management system that stores data in tables with columns and rows.
right outer join
A join that is guaranteed to keep all rows from the second (right) table. Columns from the left table are filled with actual values if available or with null otherwise. SQLite does not implement right outer join since its behavior can be reproduced by swapping the order of the tables and using a left outer join.
statement
A part of a program that doesn’t produce a value.
subquery
A query used within another query.
table
A collection of related data in a database stored in columns and rows.
table-valued function
A function that returns multiple values rather than a single value.
temporary table
A table that is explicitly constructed in memory outside any particular query.
ternary logic
A logic based on three values: true, false, and “don’t know” (represented as null).
tombstone
A marker value added to a record to show that it is no longer active. Tombstones are used as an alternative to deleting data.
trigger
An action that runs automatically when something happens in a database, typically insertion or deletion.
Uniform Resource Identifier (URI)
A string that identifies a resource (such as a web page or database) and the protocol used to access it.
upsert
To update a record if it exists or insert (create) a new record if it doesn’t.
vectorization
Performing the same operation on a stream of values rather than using a loop to operate on one value at a time.
view
A rearrangement of data in a database that is regenerated on demand.
window function
A function that combines data from adjacent rows in a database query’s result.

### Acknowledgments

This tutorial would not have been possible without:

I would also like to thank the following for spotting issues, making suggestions, or submitting changes:

• Yanina Bellini Saibene
• Phillip Cloud
• Zoe Daniels
• Conor Flynn
• Andy Goldberg
• Jay Graves
• Sam Hames