Tools
Terms defined: B-tree, alias, correlated subquery, data migration, entity-relationship diagram, index, table-valued function, window function
Altering Tables
alter table job
add ident integer not null default -1;
update job
set ident = 1
where name = 'calibrate';
update job
set ident = 2
where name = 'clean';
select * from job;
| name | billable | ident |
|-----------|----------|-------|
| calibrate | 1.5 | 1 |
| clean | 0.5 | 2 |
- Add a column after the fact
- Since it can't be null, we have to provide a default value
- Really want to make it the primary key, but SQLite doesn't allow that after the fact
- Then use
updateto modify existing records- Can modify any number of records at once
- So be careful about
whereclause
- An example of data migration
Creating New Tables from Old
create table new_work (
person_id integer not null,
job_id integer not null,
foreign key (person_id) references person (ident),
foreign key (job_id) references job (ident)
);
insert into new_work
select
person.ident as person_id,
job.ident as job_id
from
(person inner join work on person.name = work.person)
inner join job on job.name = work.job;
select * from new_work;
| person_id | job_id |
|-----------|--------|
| 1 | 1 |
| 1 | 2 |
| 2 | 2 |
new_workis our join table- Each column refers to a record in some other table
Removing Tables
drop table work;
alter table new_work rename to work;
CREATE TABLE job (
ident integer primary key autoincrement,
name text not null,
billable real not null
);
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person (
ident integer primary key autoincrement,
name text not null
);
CREATE TABLE IF NOT EXISTS "work" (
person_id integer not null,
job_id integer not null,
foreign key(person_id) references person(ident),
foreign key(job_id) references job(ident)
);
- Remove the old table and rename the new one to take its place
- Note
if exists
- Note
- Please back up your data first
Exercise
-
Reorganize the penguins database:
- Make a copy of the
penguins.dbfile so that your changes won't affect the original. - Write a SQL script that reorganizes the data into three tables: one for each island.
- Why is organizing data like this a bad idea?
- Make a copy of the
-
Tools like Sqitch can manage changes to database schemas and data so that they can be saved in version control and rolled back if they are unsuccessful. Translate the changes made by the scripts above into Sqitch. Note: this exercise may take an hour or more.
Explaining Query Plans
explain query plan
select
species,
avg(body_mass_g)
from penguins
group by species;
QUERY PLAN
|--SCAN penguins
`--USE TEMP B-TREE FOR GROUP BY
- SQLite plans to scan every row of the table
- It will build a temporary B-tree data structure to group rows
Exercise
Use a CTE to find the number of penguins that weigh the same as the lightest penguin of the same sex and species.
Enumerating Rows
- Every table has a special column called
rowid
select
rowid,
species,
island
from penguins
limit 5;
| rowid | species | island |
|-------|---------|-----------|
| 1 | Adelie | Torgersen |
| 2 | Adelie | Torgersen |
| 3 | Adelie | Torgersen |
| 4 | Adelie | Torgersen |
| 5 | Adelie | Torgersen |
rowidis persistent within a session- I.e., if we delete the first 5 rows we now have row IDs 6…N
- Do not rely on row ID
- In particular, do not use it as a key
Exercise
To explore how row IDs behave:
-
Suppose that you create a new table, add three rows, delete those rows, and add the same values again. Do you expect the row IDs of the final rows to be 1–3 or 4–6?
-
Using an in-memory database, perform the steps in part 1. Was the result what you expected?
Yet Another Database
- Entity-relationship diagram (ER diagram) shows relationships between tables
- Like everything to do with databases, there are lots of variations
select * from staff;
| ident | personal | family | dept | age |
|-------|----------|-----------|------|-----|
| 1 | Kartik | Gupta | | 46 |
| 2 | Divit | Dhaliwal | hist | 34 |
| 3 | Indrans | Sridhar | mb | 47 |
| 4 | Pranay | Khanna | mb | 51 |
| 5 | Riaan | Dua | | 23 |
| 6 | Vedika | Rout | hist | 45 |
| 7 | Abram | Chokshi | gen | 23 |
| 8 | Romil | Kapoor | hist | 38 |
| 9 | Ishaan | Ramaswamy | mb | 35 |
| 10 | Nitya | Lal | gen | 52 |
Exercise
Draw a table diagram and an ER diagram to represent the following database:
personhasidandfull_namecoursehasidandnamesectionhascourse_id,start_date, andend_dateinstructorhasperson_idandsection_idstudenthasperson_id,section_id, andstatus
Pattern Matching
select
personal,
family
from staff
where personal like '%ya%';
| personal | family |
|----------|--------|
| Nitya | Lal |
likeis the original SQL pattern matcher%matches zero or more characters at the start or end of a string- Case insensitive by default
globsupports Unix-style wildcards
Exercise
Rewrite the pattern-matching query shown above using glob.
Selecting First and Last Rows
select * from (
select * from (select * from experiment order by started asc limit 5)
union all
select * from (select * from experiment order by started desc limit 5)
)
order by started asc;
| ident | kind | started | ended |
|-------|-------------|------------|------------|
| 17 | trial | 2023-01-29 | 2023-01-30 |
| 35 | calibration | 2023-01-30 | 2023-01-30 |
| 36 | trial | 2023-02-02 | 2023-02-03 |
| 25 | trial | 2023-02-12 | 2023-02-14 |
| 2 | calibration | 2023-02-14 | 2023-02-14 |
| 40 | calibration | 2024-01-21 | 2024-01-21 |
| 12 | trial | 2024-01-26 | 2024-01-28 |
| 44 | trial | 2024-01-27 | 2024-01-29 |
| 34 | trial | 2024-02-01 | 2024-02-02 |
| 14 | calibration | 2024-02-03 | 2024-02-03 |
union allcombines records- Keeps duplicates:
unionon its own only keeps unique records - Which is more work but sometimes more useful
- Keeps duplicates:
- Yes, it feels like the extra
select * fromshould be unnecessary
Exercise
Write a query whose result includes two rows for each Adelie penguin
in the penguins database.
How can you check that your query is working correctly?
Intersection
select
personal,
family,
dept,
age
from staff
where dept = 'mb'
intersect
select
personal,
family,
dept,
age from staff
where age < 50;
| personal | family | dept | age |
|----------|-----------|------|-----|
| Indrans | Sridhar | mb | 47 |
| Ishaan | Ramaswamy | mb | 35 |
- Rows involved must have the same structure
- Intersection usually used when pulling values from different sources
- In the query above, would be clearer to use
where
- In the query above, would be clearer to use
Exercise
Use intersect to find all Adelie penguins that weigh more than 4000 grams.
How can you check that your query is working correctly?
Use explain query plan to compare the intersect-based query you just wrote
with one that uses where.
Which query looks like it will be more efficient?
Why do you believe this?
Exclusion
select
personal,
family,
dept,
age
from staff
where dept = 'mb'
except
select
personal,
family,
dept,
age from staff
where age < 50;
| personal | family | dept | age |
|----------|--------|------|-----|
| Pranay | Khanna | mb | 51 |
- Again, tables must have same structure
- And this would be clearer with
where
- And this would be clearer with
- SQL operates on sets, not tables, except where it doesn't
Exercise
Use exclude to find all Gentoo penguins that aren't male.
How can you check that your query is working correctly?
Random Numbers and Why Not
with decorated as (
select random() as rand,
personal || ' ' || family as name
from staff
)
select
rand,
abs(rand) % 10 as selector,
name
from decorated
where selector < 5;
| rand | selector | name |
|----------------------|----------|-----------------|
| -5088363674211922423 | 0 | Divit Dhaliwal |
| 6557666280550701355 | 1 | Indrans Sridhar |
| -2149788664940846734 | 3 | Pranay Khanna |
| -3941247926715736890 | 8 | Riaan Dua |
| -3101076015498625604 | 5 | Vedika Rout |
| -7884339441528700576 | 4 | Abram Chokshi |
| -2718521057113461678 | 4 | Romil Kapoor |
- There is no way to seed SQLite's random number generator
- Which means there is no way to reproduce its pseudo-random sequences
- Which means you should never use it
- How are you going to debug something you can't re-run?
Exercise
Write a query that:
-
uses a CTE to create 1000 random numbers between 0 and 10 inclusive;
-
uses a second CTE to calculate their mean; and
-
uses a third CTE and SQLite's built-in math functions to calculate their standard deviation.
Creating an Index
explain query plan
select filename
from plate
where filename like '%07%';
create index plate_file on plate(filename);
explain query plan
select filename
from plate
where filename like '%07%';
QUERY PLAN
`--SCAN plate USING COVERING INDEX sqlite_autoindex_plate_1
QUERY PLAN
`--SCAN plate USING COVERING INDEX plate_file
- An index is an auxiliary data structure that enables faster access to records
- Spend storage space to buy speed
- Don't have to mention it explicitly in queries
- Database manager will use it automatically
- Unlike primary keys, SQLite supports defining indexes after the fact
Generating Sequences
select value from generate_series(1, 5);
| value |
|-------|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
- A (non-standard) table-valued function
Generating Sequences Based on Data
create table temp (
num integer not null
);
insert into temp values (1), (5);
select value from generate_series (
(select min(num) from temp),
(select max(num) from temp)
);
| value |
|-------|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
- Must have the parentheses around the
minandmaxselections to keep SQLite happy
Generating Sequences of Dates
select date((select julianday(min(started)) from experiment) + value) as some_day
from (
select value from generate_series(
(select 0),
(select julianday(max(started)) - julianday(min(started)) from experiment)
)
)
limit 5;
| some_day |
|------------|
| 2023-01-29 |
| 2023-01-30 |
| 2023-01-31 |
| 2023-02-01 |
| 2023-02-02 |
- SQLite represents dates as YYYY-MM-DD strings
or as Julian days or as Unix milliseconds or…
- Julian days is fractional number of days since November 24, 4714 BCE
juliandayanddateconvert back and forthjuliandayis specific to SQLite- Other databases have their own date handling functions
Counting Experiments Started per Day Without Gaps
with
-- complete sequence of days with 0 as placeholder for number of experiments
all_days as (
select
date((select julianday(min(started)) from experiment) + value) as some_day,
0 as zeroes
from (
select value from generate_series(
(select 0),
(select count(*) - 1 from experiment)
)
)
),
-- sequence of actual days with actual number of experiments started
actual_days as (
select
started,
count(started) as num_exp
from experiment
group by started
)
-- combined by joining on day and taking actual number (if available) or zero
select
all_days.some_day as day,
coalesce(actual_days.num_exp, all_days.zeroes) as num_exp
from
all_days left join actual_days on all_days.some_day = actual_days.started
limit 5;
| day | num_exp |
|------------|---------|
| 2023-01-29 | 1 |
| 2023-01-30 | 1 |
| 2023-01-31 | 0 |
| 2023-02-01 | 0 |
| 2023-02-02 | 1 |
Exercise
What does the expression date('now', 'start of month', '+1 month', '-1 day') produce?
(You may find the documentation on SQLite's date and time functions helpful.)
Self Join
with person as (
select
ident,
personal || ' ' || family as name
from staff
)
select
left_person.name,
right_person.name
from person as left_person inner join person as right_person
limit 10;
| name | name |
|--------------|------------------|
| Kartik Gupta | Kartik Gupta |
| Kartik Gupta | Divit Dhaliwal |
| Kartik Gupta | Indrans Sridhar |
| Kartik Gupta | Pranay Khanna |
| Kartik Gupta | Riaan Dua |
| Kartik Gupta | Vedika Rout |
| Kartik Gupta | Abram Chokshi |
| Kartik Gupta | Romil Kapoor |
| Kartik Gupta | Ishaan Ramaswamy |
| Kartik Gupta | Nitya Lal |
- Join a table to itself
- Use
asto create aliases for copies of tables to distinguish them - Nothing special about the names
leftandright
- Use
- Get all n^2 pairs, including person with themself
Generating Unique Pairs
with person as (
select
ident,
personal || ' ' || family as name
from staff
)
select
left_person.name,
right_person.name
from person as left_person inner join person as right_person
on left_person.ident < right_person.ident
where left_person.ident <= 4 and right_person.ident <= 4;
| name | name |
|-----------------|-----------------|
| Kartik Gupta | Divit Dhaliwal |
| Kartik Gupta | Indrans Sridhar |
| Kartik Gupta | Pranay Khanna |
| Divit Dhaliwal | Indrans Sridhar |
| Divit Dhaliwal | Pranay Khanna |
| Indrans Sridhar | Pranay Khanna |
left.ident < right.identensures distinct pairs without duplicates- Query uses
left.ident <= 4 and right.ident <= 4to shorten output
- Query uses
- Quick check: n(n-1)/2 pairs
Filtering Pairs
with
person as (
select
ident,
personal || ' ' || family as name
from staff
),
together as (
select
left_perf.staff as left_staff,
right_perf.staff as right_staff
from performed as left_perf inner join performed as right_perf
on left_perf.experiment = right_perf.experiment
where left_staff < right_staff
)
select
left_person.name as person_1,
right_person.name as person_2
from person as left_person inner join person as right_person join together
on left_person.ident = left_staff and right_person.ident = right_staff;
| person_1 | person_2 |
|-----------------|------------------|
| Kartik Gupta | Vedika Rout |
| Pranay Khanna | Vedika Rout |
| Indrans Sridhar | Romil Kapoor |
| Abram Chokshi | Ishaan Ramaswamy |
| Pranay Khanna | Vedika Rout |
| Kartik Gupta | Abram Chokshi |
| Abram Chokshi | Romil Kapoor |
| Kartik Gupta | Divit Dhaliwal |
| Divit Dhaliwal | Abram Chokshi |
| Pranay Khanna | Ishaan Ramaswamy |
| Indrans Sridhar | Romil Kapoor |
| Kartik Gupta | Ishaan Ramaswamy |
| Kartik Gupta | Nitya Lal |
| Kartik Gupta | Abram Chokshi |
| Pranay Khanna | Romil Kapoor |
Existence and Correlated Subqueries
select
name,
building
from department
where
exists (
select 1
from staff
where dept = department.ident
)
order by name;
| name | building |
|-------------------|------------------|
| Genetics | Chesson |
| Histology | Fashet Extension |
| Molecular Biology | Chesson |
- Endocrinology is missing from the list
select 1could equally beselect trueor any other value- A correlated subquery depends on a value from the outer query
- Equivalent to nested loop
Nonexistence
select
name,
building
from department
where
not exists (
select 1
from staff
where dept = department.ident
)
order by name;
| name | building |
|---------------|----------|
| Endocrinology | TGVH |
Exercise
Can you rewrite the previous query using exclude?
If so, is your new query easier to understand?
If the query cannot be rewritten, why not?
Avoiding Correlated Subqueries
select distinct
department.name as name,
department.building as building
from department inner join staff
on department.ident = staff.dept
order by name;
| name | building |
|-------------------|------------------|
| Genetics | Chesson |
| Histology | Fashet Extension |
| Molecular Biology | Chesson |
- The join might or might not be faster than the correlated subquery
- Hard to find unstaffed departments without either
not existsorcountand a check for 0
Lead and Lag
with ym_num as (
select
strftime('%Y-%m', started) as ym,
count(*) as num
from experiment
group by ym
)
select
ym,
lag(num) over (order by ym) as prev_num,
num,
lead(num) over (order by ym) as next_num
from ym_num
order by ym;
| ym | prev_num | num | next_num |
|---------|----------|-----|----------|
| 2023-01 | | 2 | 5 |
| 2023-02 | 2 | 5 | 5 |
| 2023-03 | 5 | 5 | 1 |
| 2023-04 | 5 | 1 | 6 |
| 2023-05 | 1 | 6 | 5 |
| 2023-06 | 6 | 5 | 3 |
| 2023-07 | 5 | 3 | 2 |
| 2023-08 | 3 | 2 | 4 |
| 2023-09 | 2 | 4 | 6 |
| 2023-10 | 4 | 6 | 4 |
| 2023-12 | 6 | 4 | 5 |
| 2024-01 | 4 | 5 | 2 |
| 2024-02 | 5 | 2 | |
- Use
strftimeto extract year and month- Clumsy, but date/time handling is not SQLite's strong point
- Use window functions
leadandlagto shift values- Unavailable values at the top or bottom are null
Boundaries
- Documentation on SQLite's window functions describes three frame types and five kinds of frame boundary
- It feels very ad hoc, but so does the real world
Windowing Functions
with ym_num as (
select
strftime('%Y-%m', started) as ym,
count(*) as num
from experiment
group by ym
)
select
ym,
num,
sum(num) over (order by ym) as num_done,
(sum(num) over (order by ym) * 1.00) / (select sum(num) from ym_num) as completed_progress,
cume_dist() over (order by ym) as linear_progress
from ym_num
order by ym;
| ym | num | num_done | completed_progress | linear_progress |
|---------|-----|----------|--------------------|--------------------|
| 2023-01 | 2 | 2 | 0.04 | 0.0769230769230769 |
| 2023-02 | 5 | 7 | 0.14 | 0.153846153846154 |
| 2023-03 | 5 | 12 | 0.24 | 0.230769230769231 |
| 2023-04 | 1 | 13 | 0.26 | 0.307692307692308 |
| 2023-05 | 6 | 19 | 0.38 | 0.384615384615385 |
| 2023-06 | 5 | 24 | 0.48 | 0.461538461538462 |
| 2023-07 | 3 | 27 | 0.54 | 0.538461538461538 |
| 2023-08 | 2 | 29 | 0.58 | 0.615384615384615 |
| 2023-09 | 4 | 33 | 0.66 | 0.692307692307692 |
| 2023-10 | 6 | 39 | 0.78 | 0.769230769230769 |
| 2023-12 | 4 | 43 | 0.86 | 0.846153846153846 |
| 2024-01 | 5 | 48 | 0.96 | 0.923076923076923 |
| 2024-02 | 2 | 50 | 1.0 | 1.0 |
sum() overdoes a running totalcume_dist()is fraction of rows seen so far- So
num_donecolumn is number of experiments done… - …
completed_progressis the fraction of experiments done… - …and
linear_progressis the fraction of time passed
Explaining Another Query Plan
explain query plan
with ym_num as (
select
strftime('%Y-%m', started) as ym,
count(*) as num
from experiment
group by ym
)
select
ym,
num,
sum(num) over (order by ym) as num_done,
cume_dist() over (order by ym) as progress
from ym_num
order by ym;
QUERY PLAN
|--CO-ROUTINE (subquery-3)
| |--CO-ROUTINE (subquery-4)
| | |--CO-ROUTINE ym_num
| | | |--SCAN experiment
| | | `--USE TEMP B-TREE FOR GROUP BY
| | |--SCAN ym_num
| | `--USE TEMP B-TREE FOR ORDER BY
| `--SCAN (subquery-4)
`--SCAN (subquery-3)
- Becomes useful…eventually
Partitioned Windows
with y_m_num as (
select
strftime('%Y', started) as year,
strftime('%m', started) as month,
count(*) as num
from experiment
group by year, month
)
select
year,
month,
num,
sum(num) over (partition by year order by month) as num_done
from y_m_num
order by year, month;
| year | month | num | num_done |
|------|-------|-----|----------|
| 2023 | 01 | 2 | 2 |
| 2023 | 02 | 5 | 7 |
| 2023 | 03 | 5 | 12 |
| 2023 | 04 | 1 | 13 |
| 2023 | 05 | 6 | 19 |
| 2023 | 06 | 5 | 24 |
| 2023 | 07 | 3 | 27 |
| 2023 | 08 | 2 | 29 |
| 2023 | 09 | 4 | 33 |
| 2023 | 10 | 6 | 39 |
| 2023 | 12 | 4 | 43 |
| 2024 | 01 | 5 | 5 |
| 2024 | 02 | 2 | 7 |
partition bycreates groups- So this counts experiments started since the beginning of each year
Exercise
Create a query that:
- finds the unique weights of the penguins in the
penguinsdatabase; - sorts them;
- finds the difference between each successive distinct weight; and
- counts how many times each unique difference appears.