Advanced Features

Terms defined: atomic, base case, Binary Large Object, consistent, comma-separated values, denormalization, durable, infinite recursion, isolated, JavaScript Object Notation, materialized view, normal form, path expression, recursive case, recursive CTE, temporary table, trigger, upsert, view

Blobs

create table images (
    name text not null,
    content blob
);

insert into images (name, content) values
('biohazard', readfile('img/biohazard.png')),
('crush', readfile('img/crush.png')),
('fire', readfile('img/fire.png')),
('radioactive', readfile('img/radioactive.png')),
('tripping', readfile('img/tripping.png'));

select
    name,
    length(content)
from images;

|    name     | length(content) |
|-------------|-----------------|
| biohazard   | 19629           |
| crush       | 15967           |
| fire        | 18699           |
| radioactive | 16661           |
| tripping    | 17208           |

A blob is a binary large object
- Bytes in, bytes out…
If you think that’s odd, check out Fossil

Exercise

Modify the query shown above to select the value of content rather than its length. How intelligible is the output? Does using SQLite’s hex() function make it any more readable?

Yet Another Database

sqlite3 db/lab_log.db

.schema

CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person(
       ident            integer primary key autoincrement,
       details          text not null
);
CREATE TABLE machine(
       ident            integer primary key autoincrement,
       name             text not null,
       details          text not null
);
CREATE TABLE usage(
       ident            integer primary key autoincrement,
       log              text not null
);

Storing JSON

select * from machine;

| ident |      name      |                         details                         |
|-------|----------------|---------------------------------------------------------|
| 1     | WY401          | {"acquired": "2023-05-01"}                              |
| 2     | Inphormex      | {"acquired": "2021-07-15", "refurbished": "2023-10-22"} |
| 3     | AutoPlate 9000 | {"note": "needs software update"}                       |

Store heterogeneous data as JSON-formatted text (with double-quoted strings)
- Database parses the text each time it is queried, so performance can be an issue
Can alternatively store as blob (jsonb)
- Can’t view it directly
- But more efficient

Select Fields from JSON

select
    details->'$.acquired' as single_arrow,
    details->>'$.acquired' as double_arrow
from machine;

| single_arrow | double_arrow |
|--------------|--------------|
| "2023-05-01" | 2023-05-01   |
| "2021-07-15" | 2021-07-15   |
|              |              |

Single arrow -> returns JSON representation of result
Double arrow ->> returns SQL text, integer, real, or null
Left side is column
Right side is path expression
- Start with $ (meaning “root”)
- Fields separated by .

Exercise

Write a query that selects the year from the "refurbished" field of the JSON data associated with the Inphormex plate reader.

JSON Array Access

select
    ident,
    json_array_length(log->'$') as length,
    log->'$[0]' as first
from usage;

| ident | length |                            first                             |
|-------|--------|--------------------------------------------------------------|
| 1     | 4      | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 5      | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 3     | 2      | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |
| 4     | 1      | {"machine":"sterilizer","person":["Maude","Goulet"]}         |
| 5     | 2      | {"machine":"AutoPlate 9000","person":["Brigitte","Michaud"]} |
| 6     | 1      | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 7     | 3      | {"machine":"WY401","person":["Maude","Goulet"]}              |
| 8     | 1      | {"machine":"AutoPlate 9000"}                                 |

SQLite and other database managers have many JSON manipulation functions
json_array_length gives number of elements in selected array
Subscripts start with 0
Characters outside 7-bit ASCII represented as Unicode escapes

Unpacking JSON Arrays

select
    ident,
    json_each.key as key,
    json_each.value as value
from usage, json_each(usage.log)
limit 10;

| ident | key |                            value                             |
|-------|-----|--------------------------------------------------------------|
| 1     | 0   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 1   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 2   | {"machine":"WY401","person":["Gabrielle","Dub\u00e9"]}       |
| 1     | 3   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 0   | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 2     | 1   | {"machine":"AutoPlate 9000","person":["Marianne","Richer"]}  |
| 2     | 2   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 2     | 3   | {"machine":"AutoPlate 9000","person":["Monique","Marcotte"]} |
| 2     | 4   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 3     | 0   | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |

json_each is another table-valued function
Use json_each.name to get properties of unpacked array

Exercise

Write a query that counts how many times each person appears in the first log entry associated with any piece of equipment.

Selecting the Last Element of an Array

select
    ident,
    log->'$[#-1].machine' as final
from usage
limit 5;

| ident |    final     |
|-------|--------------|
| 1     | "Inphormex"  |
| 2     | "sterilizer" |
| 3     | "Inphormex"  |
| 4     | "sterilizer" |
| 5     | "sterilizer" |

Modifying JSON

select
    ident,
    name,
    json_set(details, '$.sold', json_quote('2024-01-25')) as updated
from machine;

| ident |      name      |                           updated                            |
|-------|----------------|--------------------------------------------------------------|
| 1     | WY401          | {"acquired":"2023-05-01","sold":"2024-01-25"}                |
| 2     | Inphormex      | {"acquired":"2021-07-15","refurbished":"2023-10-22","sold":" |
|       |                | 2024-01-25"}                                                 |
| 3     | AutoPlate 9000 | {"note":"needs software update","sold":"2024-01-25"}         |

Updates the in-memory copy of the JSON, not the database record
Please use json_quote rather than trying to format JSON with string operations

Exercise

As part of cleaning up the lab log database, replace the machine names in the JSON records in usage with the corresopnding machine IDs from the machine table.

Refreshing the Penguins Database

select
    species,
    count(*) as num
from penguins
group by species;

|  species  | num |
|-----------|-----|
| Adelie    | 152 |
| Chinstrap | 68  |
| Gentoo    | 124 |

We will restore full database after each example

Tombstones

alter table penguins
add active integer not null default 1;

update penguins
set active = iif(species = 'Adelie', 0, 1);

select
    species,
    count(*) as num
from penguins
where active
group by species;

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |

Use a tombstone to mark (in)active records
Every query must now include it

Importing CSV Data

SQLite and most other database managers have tools for importing and exporting CSV
In SQLite:
- Define table
- Import data
- Convert empty strings to nulls (if desired)
- Convert types from text to whatever (not shown below)

drop table if exists penguins;
.mode csv penguins
.import misc/penguins.csv penguins
update penguins set species = null where species = '';
update penguins set island = null where island = '';
update penguins set bill_length_mm = null where bill_length_mm = '';
update penguins set bill_depth_mm = null where bill_depth_mm = '';
update penguins set flipper_length_mm = null where flipper_length_mm = '';
update penguins set body_mass_g = null where body_mass_g = '';
update penguins set sex = null where sex = '';

Exercise

What are the data types of the columns in the penguins table created by the CSV import shown above? How can you correct the ones that need correcting?

Views

create view if not exists
active_penguins (
    species,
    island,
    bill_length_mm,
    bill_depth_mm,
    flipper_length_mm,
    body_mass_g,
    sex
) as
select
    species,
    island,
    bill_length_mm,
    bill_depth_mm,
    flipper_length_mm,
    body_mass_g,
    sex
from penguins
where active;

select
    species,
    count(*) as num
from active_penguins
group by species;

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |

A view is a saved query that other queries can invoke
View is re-run each time it’s used
Like a CTE, but:
- Can be shared between queries
- Views came first
Some databases offer materialized views
- Update-on-demand temporary tables

Exercise

Create a view in the lab log database called busy with two columns: machine_id and total_log_length. The first column records the numeric ID of each machine; the second shows the total number of log entries for that machine.

Check Understanding

Figure 4.1: temporary tables

Hours Reminder

create table job (
    name text not null,
    billable real not null
);
insert into job values
('calibrate', 1.5),
('clean', 0.5);
select * from job;

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
| clean     | 0.5      |

Adding Checks

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);
insert into job values ('calibrate', 1.5);
insert into job values ('reset', -0.5);
select * from job;

Runtime error near line 9: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

check adds constraint to table
- Must produce a Boolean result
- Run each time values added or modified
But changes made before the error have taken effect

Exercise

Rewrite the definition of the penguins table to add the following constraints:

body_mass_g must be null or non-negative.
island must be one of "Biscoe", "Dream", or "Torgersen". (Hint: the in operator will be useful here.)

ACID

Atomic: change cannot be broken down into smaller ones (i.e., all or nothing)
Consistent: database goes from one consistent state to another
Isolated: looks like changes happened one after another
Durable: if change takes place, it’s still there after a restart

Transactions

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);

insert into job values ('calibrate', 1.5);

begin transaction;
insert into job values ('clean', 0.5);
rollback;

select * from job;

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

Statements outside transaction execute and are committed immediately
Statement(s) inside transaction don’t take effect until:
- end transaction (success)
- rollback (undo)
Can have any number of statements inside a transaction
But cannot nest transactions in SQLite
- Other databases support this

Rollback in Constraints

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0) on conflict rollback
);

insert into job values
    ('calibrate', 1.5);
insert into job values
    ('clean', 0.5),
    ('reset', -0.5);

select * from job;

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

All of second insert rolled back as soon as error occurred
But first insert took effect

Rollback in Statements

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);

insert or rollback into job values
('calibrate', 1.5);
insert or rollback into job values
('clean', 0.5),
('reset', -0.5);

select * from job;

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

Constraint is in table definition
Action is in statement

Upsert

create table jobs_done (
    person text unique,
    num integer default 0
);

insert into jobs_done values
('zia', 1);
.print 'after first'
select * from jobs_done;
.print


insert into jobs_done values
('zia', 1);
.print 'after failed'
select * from jobs_done;

insert into jobs_done values
('zia', 1)
on conflict(person) do update set num = num + 1;
.print '\nafter upsert'
select * from jobs_done;

after first
| person | num |
|--------|-----|
| zia    | 1   |

Runtime error near line 15: UNIQUE constraint failed: jobs_done.person (19)
after failed
| person | num |
|--------|-----|
| zia    | 1   |
\nafter upsert
| person | num |
|--------|-----|
| zia    | 2   |

upsert stands for “update or insert”
- Create if record doesn’t exist
- Update if it does
Not standard SQL but widely implemented
Example also shows use of SQLite .print command

Exercise

Using the assay database, write a query that adds or modifies people in the staff table as shown:

personal	family	dept	age
Pranay	Khanna	mb	41
Riaan	Dua	gen	23
Parth	Johel	gen	27

Normalization

First normal form (1NF): every field of every record contains one indivisible value.
Second normal form (2NF) and third normal form (3NF): every value in a record that isn’t a key depends solely on the key, not on other values.
Denormalization: explicitly store values that could be calculated on the fly
- To simplify queries and/or make processing faster

Creating Triggers

-- Track hours of lab work.
create table job (
    person text not null,
    reported real not null check (reported >= 0.0)
);

-- Explicitly store per-person total rather than using sum().
create table total (
    person text unique not null,
    hours real
);

-- Initialize totals.
insert into total values
('gene', 0.0),
('august', 0.0);

-- Define a trigger.
create trigger total_trigger
before insert on job
begin
    -- Check that the person exists.
    select case
        when not exists (select 1 from total where person = new.person)
        then raise(rollback, 'Unknown person ')
    end;
    -- Update their total hours (or fail if non-negative constraint violated).
    update total
    set hours = hours + new.reported
    where total.person = new.person;
end;

A trigger automatically runs before or after a specified operation
Can have side effects (e.g., update some other table)
And/or implement checks (e.g., make sure other records exist)
Add processing overhead…
…but data is either cheap or correct, never both
Inside trigger, refer to old and new versions of record as old.column and new.column

Trigger Not Firing

insert into job values
('gene', 1.5),
('august', 0.5),
('gene', 1.0);

| person | reported |
|--------|----------|
| gene   | 1.5      |
| august | 0.5      |
| gene   | 1.0      |

| person | hours |
|--------|-------|
| gene   | 2.5   |
| august | 0.5   |

Trigger Firing

insert into job values
('gene', 1.0),
('august', -1.0);

Runtime error near line 6: CHECK constraint failed: reported >= 0.0 (19)

| person | hours |
|--------|-------|
| gene   | 0.0   |
| august | 0.0   |

Exercise

Using the penguins database:

create a table called species with columns name and count; and
define a trigger that increments the count associated with each species each time a new penguin is added to the penguins table.

Does your solution behave correctly when several penguins are added by a single insert statement?

Representing Graphs

create table lineage (
    parent text not null,
    child text not null
);
insert into lineage values
('Arturo', 'Clemente'),
('Darío', 'Clemente'),
('Clemente', 'Homero'),
('Clemente', 'Ivonne'),
('Ivonne', 'Lourdes'),
('Soledad', 'Lourdes'),
('Lourdes', 'Santiago');

select * from lineage;

|  parent  |  child   |
|----------|----------|
| Arturo   | Clemente |
| Darío    | Clemente |
| Clemente | Homero   |
| Clemente | Ivonne   |
| Ivonne   | Lourdes  |
| Soledad  | Lourdes  |
| Lourdes  | Santiago |

Figure 4.2: lineage diagram

Exercise

Write a query that uses a self join to find every person’s grandchildren.

Recursive Queries

with recursive descendent as (
    select
        'Clemente' as person,
        0 as generations
    union all
    select
        lineage.child as person,
        descendent.generations + 1 as generations
    from descendent inner join lineage
        on descendent.person = lineage.parent
)

select
    person,
    generations
from descendent;

|  person  | generations |
|----------|-------------|
| Clemente | 0           |
| Homero   | 1           |
| Ivonne   | 1           |
| Lourdes  | 2           |
| Santiago | 3           |

Use a recursive CTE to create a temporary table (descendent)
Base case seeds this table
Recursive case relies on value(s) already in that table and external table(s)
union all to combine rows
- Can use union but that has lower performance (must check uniqueness each time)
Stops when the recursive case yields an empty row set (nothing new to add)
Then select the desired values from the CTE

Exercise

Modify the recursive query shown above to use union instead of union all. Does this affect the result? Why or why not?

Contact Tracing Database

select * from person;

| ident |         name          |
|-------|-----------------------|
| 1     | Juana Baeza           |
| 2     | Agustín Rodríquez     |
| 3     | Ariadna Caraballo     |
| 4     | Micaela Laboy         |
| 5     | Verónica Altamirano   |
| 6     | Reina Rivero          |
| 7     | Elias Merino          |
| 8     | Minerva Guerrero      |
| 9     | Mauro Balderas        |
| 10    | Pilar Alarcón         |
| 11    | Daniela Menéndez      |
| 12    | Marco Antonio Barrera |
| 13    | Cristal Soliz         |
| 14    | Bernardo Narváez      |
| 15    | Óscar Barrios         |

select * from contact;

|       left        |         right         |
|-------------------|-----------------------|
| Agustín Rodríquez | Ariadna Caraballo     |
| Agustín Rodríquez | Verónica Altamirano   |
| Juana Baeza       | Verónica Altamirano   |
| Juana Baeza       | Micaela Laboy         |
| Pilar Alarcón     | Reina Rivero          |
| Cristal Soliz     | Marco Antonio Barrera |
| Cristal Soliz     | Daniela Menéndez      |
| Daniela Menéndez  | Marco Antonio Barrera |

box and line diagram showing who has had contact with whom — Figure 4.3: contact diagram

Bidirectional Contacts

create temporary table bi_contact (
    left text,
    right text
);

insert into bi_contact
select
    left, right from contact
    union all
    select right, left from contact
;

| original_count |
|----------------|
| 8              |

| num_contact |
|-------------|
| 16          |

Create a temporary table rather than using a long chain of CTEs
- Only lasts as long as the session (not saved to disk)
Duplicate information rather than writing more complicated query

Updating Group Identifiers

select
    left.name as left_name,
    left.ident as left_ident,
    right.name as right_name,
    right.ident as right_ident,
    min(left.ident, right.ident) as new_ident
from
    (person as left join bi_contact on left.name = bi_contact.left)
    join person as right on bi_contact.right = right.name;

|       left_name       | left_ident |      right_name       | right_ident | new_ident |
|-----------------------|------------|-----------------------|-------------|-----------|
| Juana Baeza           | 1          | Micaela Laboy         | 4           | 1         |
| Juana Baeza           | 1          | Verónica Altamirano   | 5           | 1         |
| Agustín Rodríquez     | 2          | Ariadna Caraballo     | 3           | 2         |
| Agustín Rodríquez     | 2          | Verónica Altamirano   | 5           | 2         |
| Ariadna Caraballo     | 3          | Agustín Rodríquez     | 2           | 2         |
| Micaela Laboy         | 4          | Juana Baeza           | 1           | 1         |
| Verónica Altamirano   | 5          | Agustín Rodríquez     | 2           | 2         |
| Verónica Altamirano   | 5          | Juana Baeza           | 1           | 1         |
| Reina Rivero          | 6          | Pilar Alarcón         | 10          | 6         |
| Pilar Alarcón         | 10         | Reina Rivero          | 6           | 6         |
| Daniela Menéndez      | 11         | Cristal Soliz         | 13          | 11        |
| Daniela Menéndez      | 11         | Marco Antonio Barrera | 12          | 11        |
| Marco Antonio Barrera | 12         | Cristal Soliz         | 13          | 12        |
| Marco Antonio Barrera | 12         | Daniela Menéndez      | 11          | 11        |
| Cristal Soliz         | 13         | Daniela Menéndez      | 11          | 11        |
| Cristal Soliz         | 13         | Marco Antonio Barrera | 12          | 12        |

new_ident is minimum of own identifier and identifiers one step away
Doesn’t keep people with no contacts

Recursive Labeling

with recursive labeled as (
    select
        person.name as name,
        person.ident as label
    from
        person
    union -- not 'union all'
    select
        person.name as name,
        labeled.label as label
    from
        (person join bi_contact on person.name = bi_contact.left)
        join labeled on bi_contact.right = labeled.name
    where labeled.label < person.ident
)
select name, min(label) as group_id
from labeled
group by name
order by label, name;

|         name          | group_id |
|-----------------------|----------|
| Agustín Rodríquez     | 1        |
| Ariadna Caraballo     | 1        |
| Juana Baeza           | 1        |
| Micaela Laboy         | 1        |
| Verónica Altamirano   | 1        |
| Pilar Alarcón         | 6        |
| Reina Rivero          | 6        |
| Elias Merino          | 7        |
| Minerva Guerrero      | 8        |
| Mauro Balderas        | 9        |
| Cristal Soliz         | 11       |
| Daniela Menéndez      | 11       |
| Marco Antonio Barrera | 11       |
| Bernardo Narváez      | 14       |
| Óscar Barrios         | 15       |

Use union instead of union all to prevent infinite recursion

Exercise

Modify the query above to use union all instead of union to trigger infinite recursion. How can you modify the query so that it stops at a certain depth so that you can trace its output?

Check Understanding

Figure 4.4: common table expressions