Core Features
Terms defined: administration command, aggregation, aggregation function, cross join, exclusive or, filter, full outer join, group, in-memory database, inclusive or, join, join condition, left outer join, null, query, right outer join, ternary logic, tombstone
Selecting Constant
select 1;
1
selectis a keyword- Normally used to select data from table…
- …but if all we want is a constant value, we don't need to specify one
- Semi-colon terminator is required
Selecting All Values from Table
select * from little_penguins;
Gentoo|Biscoe|51.3|14.2|218.0|5300.0|MALE
Adelie|Dream|35.7|18.0|202.0|3550.0|FEMALE
Adelie|Torgersen|36.6|17.8|185.0|3700.0|FEMALE
Chinstrap|Dream|55.8|19.8|207.0|4000.0|MALE
Adelie|Dream|38.1|18.6|190.0|3700.0|FEMALE
Adelie|Dream|36.2|17.3|187.0|3300.0|FEMALE
Adelie|Dream|39.5|17.8|188.0|3300.0|FEMALE
Gentoo|Biscoe|42.6|13.7|213.0|4950.0|FEMALE
Gentoo|Biscoe|52.1|17.0|230.0|5550.0|MALE
Adelie|Torgersen|36.7|18.8|187.0|3800.0|FEMALE
- An actual query
- Use
*to mean "all columns" - Use
from tablenameto specify table - Output format is not particularly readable
Administrative Commands
.headers on
.mode markdown
select * from little_penguins;
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
|-----------|-----------|----------------|---------------|-------------------|-------------|--------|
| Gentoo | Biscoe | 51.3 | 14.2 | 218.0 | 5300.0 | MALE |
| Adelie | Dream | 35.7 | 18.0 | 202.0 | 3550.0 | FEMALE |
| Adelie | Torgersen | 36.6 | 17.8 | 185.0 | 3700.0 | FEMALE |
| Chinstrap | Dream | 55.8 | 19.8 | 207.0 | 4000.0 | MALE |
| Adelie | Dream | 38.1 | 18.6 | 190.0 | 3700.0 | FEMALE |
| Adelie | Dream | 36.2 | 17.3 | 187.0 | 3300.0 | FEMALE |
| Adelie | Dream | 39.5 | 17.8 | 188.0 | 3300.0 | FEMALE |
| Gentoo | Biscoe | 42.6 | 13.7 | 213.0 | 4950.0 | FEMALE |
| Gentoo | Biscoe | 52.1 | 17.0 | 230.0 | 5550.0 | MALE |
| Adelie | Torgersen | 36.7 | 18.8 | 187.0 | 3800.0 | FEMALE |
.mode markdownand.headers onmake the output more readable- These SQLite administrative commands
start with
.and aren't part of the SQL standard- PostgreSQL's special commands start with
\
- PostgreSQL's special commands start with
- Each command must appear on a line of its own
- Use
.helpfor a complete list - And as mentioned earlier, use
.quitto quit
Specifying Columns
select
species,
island,
sex
from little_penguins;
| species | island | sex |
|-----------|-----------|--------|
| Gentoo | Biscoe | MALE |
| Adelie | Dream | FEMALE |
| Adelie | Torgersen | FEMALE |
| Chinstrap | Dream | MALE |
| Adelie | Dream | FEMALE |
| Adelie | Dream | FEMALE |
| Adelie | Dream | FEMALE |
| Gentoo | Biscoe | FEMALE |
| Gentoo | Biscoe | MALE |
| Adelie | Torgersen | FEMALE |
- Specify column names separated by commas
- In any order
- Duplicates allowed
- Line breaks encouraged for readability
Sorting
select
species,
sex,
island
from little_penguins
order by island asc, sex desc;
| species | sex | island |
|-----------|--------|-----------|
| Gentoo | MALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
| Chinstrap | MALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Torgersen |
| Adelie | FEMALE | Torgersen |
order bymust followfrom(which must followselect)ascis ascending,descis descending- Default is ascending, but please specify
Exercise
Write a SQL query to select the sex and body mass columns from the little_penguins in that order,
sorted such that the largest body mass appears first.
Limiting Output
- Full dataset has 344 rows
select
species,
sex,
island
from penguins
order by species, sex, island
limit 10;
| species | sex | island |
|---------|--------|-----------|
| Adelie | | Dream |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
- Comments start with
--and run to the end of the line limit Nspecifies maximum number of rows returned by query
Paging Output
select
species,
sex,
island
from penguins
order by species, sex, island
limit 10 offset 3;
| species | sex | island |
|---------|--------|-----------|
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
offset Nmust followlimit- Specifies number of rows to skip from the start of the selection
- So this query skips the first 3 and shows the next 10
Removing Duplicates
select distinct
species,
sex,
island
from penguins;
| species | sex | island |
|-----------|--------|-----------|
| Adelie | MALE | Torgersen |
| Adelie | FEMALE | Torgersen |
| Adelie | | Torgersen |
| Adelie | FEMALE | Biscoe |
| Adelie | MALE | Biscoe |
| Adelie | FEMALE | Dream |
| Adelie | MALE | Dream |
| Adelie | | Dream |
| Chinstrap | FEMALE | Dream |
| Chinstrap | MALE | Dream |
| Gentoo | FEMALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | | Biscoe |
distinctkeyword must appear right afterselect- SQL was supposed to read like English
- Shows distinct combinations
- Blanks in
sexcolumn show missing data- We'll talk about this in a bit
Exercise
-
Write a SQL query to select the islands and species from rows 50 to 60 inclusive of the
penguinstable. Your result should have 11 rows. -
Modify your query to select distinct combinations of island and species from the same rows and compare the result to what you got in part 1.
Filtering Results
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Adelie | MALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | | Biscoe |
where conditionfilters the rows produced by selection- Condition is evaluated independently for each row
- Only rows that pass the test appear in results
- Use single quotes for
'text data'and double quotes for"weird column names"- SQLite will accept double-quoted text data but [SQLFluff][sqlfluff] will complain
Exercise
-
Write a query to select the body masses from
penguinsthat are less than 3000.0 grams. -
Write another query to select the species and sex of penguins that weight less than 3000.0 grams. This shows that the columns displayed and those used in filtering are independent of each other.
Filtering with More Complex Conditions
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex != 'MALE';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
and: both sub-conditions must be trueor: either or both part must be true- Notice that the row for Gentoo penguins on Biscoe island with unknown (empty) sex didn't pass the test
- We'll talk about this in a bit
Exercise
-
Use the
notoperator to select penguins that are not Gentoos. -
SQL's
oris an inclusive or: it succeeds if either or both conditions are true. SQL does not provide a specific operator for exclusive or, which is true if either but not both conditions are true, but the same effect can be achieved usingand,or, andnot. Write a query to select penguins that are female or on Torgersen Island but not both.
Doing Calculations
select
flipper_length_mm / 10.0,
body_mass_g / 1000.0
from penguins
limit 3;
| flipper_length_mm / 10.0 | body_mass_g / 1000.0 |
|--------------------------|----------------------|
| 18.1 | 3.75 |
| 18.6 | 3.8 |
| 19.5 | 3.25 |
- Can do the usual kinds of arithmetic on individual values
- Calculation done for each row independently
- Column name shows the calculation done
Renaming Columns
select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as where_found
from penguins
limit 3;
| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1 | 3.75 | Torgersen |
| 18.6 | 3.8 | Torgersen |
| 19.5 | 3.25 | Torgersen |
- Use
expression as nameto rename - Give result of calculation a meaningful name
- Can also rename columns without modifying
Exercise
Write a single query that calculates and returns:
- A column called
what_wherethat has the species and island of each penguin separated by a single space. - A column called
bill_ratiothat has the ratio of bill length to bill depth.
You can use the || operator to concatenate text to solve part 1,
or look at [the documentation for SQLite's format() function][sqlite_format].
Check Understanding
Calculating with Missing Values
select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as where_found
from penguins
limit 5;
| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1 | 3.75 | Torgersen |
| 18.6 | 3.8 | Torgersen |
| 19.5 | 3.25 | Torgersen |
| | | Torgersen |
| 19.3 | 3.45 | Torgersen |
- SQL uses a special value
nullto representing missing data- Not 0 or empty string, but "I don't know"
- Flipper length and body weight not known for one of the first five penguins
- "I don't know" divided by 10 or 1000 is "I don't know"
Exercise
Use SQLite's .nullvalue command
to change the printed representation of null to the string null
and then re-run the previous query.
When will displaying null as null be easier to understand?
When might it be misleading?
Null Equality
- Repeated from earlier
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Adelie | MALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | | Biscoe |
- If we ask for female penguins the row with the missing sex drops out
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex = 'FEMALE';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
Null Inequality
- But if we ask for penguins that aren't female it drops out as well
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex != 'FEMALE';
| species | sex | island |
|---------|------|--------|
| Adelie | MALE | Biscoe |
| Gentoo | MALE | Biscoe |
Ternary Logic
select null = null;
| null = null |
|-------------|
| |
- If we don't know the left and right values, we don't know if they're equal or not
- So the result is
null - Get the same answer for
null != null - Ternary logic
| equality | |||
|---|---|---|---|
| X | Y | null | |
| X | true | false | null |
| Y | false | true | null |
| null | null | null | null |
Handling Null Safely
select
species,
sex,
island
from penguins
where sex is null;
| species | sex | island |
|---------|-----|-----------|
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Dream |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
- Use
is nullandis not nullto handle null safely - Other parts of SQL handle nulls specially
Exercise
-
Write a query to find penguins whose body mass is known but whose sex is not.
-
Write another query to find penguins whose sex is known but whose body mass is not.
Check Understanding
Aggregating
select sum(body_mass_g) as total_mass
from penguins;
| total_mass |
|------------|
| 1437000.0 |
- Aggregation combines many values to produce one
sumis an aggregation function- Combines corresponding values from multiple rows
Common Aggregation Functions
select
max(bill_length_mm) as longest_bill,
min(flipper_length_mm) as shortest_flipper,
avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;
| longest_bill | shortest_flipper | weird_ratio |
|--------------|------------------|------------------|
| 59.6 | 172.0 | 2.56087082530644 |
- This actually shouldn't work: can't calculate maximum or average if any values are null
- SQL does the useful thing instead of the right one
Exercise
What is the average body mass of penguins that weight more than 3000.0 grams?
Counting
select
count(*) as count_star,
count(sex) as count_specific,
count(distinct sex) as count_distinct
from penguins;
| count_star | count_specific | count_distinct |
|------------|----------------|----------------|
| 344 | 333 | 2 |
count(*)counts rowscount(column)counts non-null entries in columncount(distinct column)counts distinct non-null entries
Exercise
How many different body masses are in the penguins dataset?
Grouping
select avg(body_mass_g) as average_mass_g
from penguins
group by sex;
| average_mass_g |
|------------------|
| 4005.55555555556 |
| 3862.27272727273 |
| 4545.68452380952 |
- Put rows in groups based on distinct combinations of values in columns specified with
group by - Then perform aggregation separately for each group
- But which is which?
Behavior of Unaggregated Columns
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex;
| sex | average_mass_g |
|--------|------------------|
| | 4005.55555555556 |
| FEMALE | 3862.27272727273 |
| MALE | 4545.68452380952 |
- All rows in each group have the same value for
sex, so no need to aggregate
Arbitrary Choice in Aggregation
select
sex,
body_mass_g
from penguins
group by sex;
| sex | body_mass_g |
|--------|-------------|
| | |
| FEMALE | 3800.0 |
| MALE | 3750.0 |
- If we don't specify how to aggregate a column,
SQLite chooses any arbitrary value from the group
- All penguins in each group have the same sex because we grouped by that, so we get the right answer
- The body mass values are in the data but unpredictable
- A common mistake
- Other database managers don't do this
- E.g., PostgreSQL complains that column must be used in an aggregation function
Exercise
Explain why the output of the previous query has a blank line before the rows for female and male penguins.
Write a query that shows each distinct body mass in the penguin dataset and the number of penguins that weigh that much.
Filtering Aggregated Values
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;
| sex | average_mass_g |
|------|------------------|
| | 4005.55555555556 |
| MALE | 4545.68452380952 |
- Using
having conditioninstead ofwhere conditionfor aggregates
Readable Output
select
sex,
round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;
| sex | average_mass_g |
|------|----------------|
| | 4005.6 |
| MALE | 4545.7 |
- Use
round(value, decimals)to round off a number
Filtering Aggregate Inputs
select
sex,
round(
avg(body_mass_g) filter (where body_mass_g < 4000.0),
1
) as average_mass_g
from penguins
group by sex;
| sex | average_mass_g |
|--------|----------------|
| | 3362.5 |
| FEMALE | 3417.3 |
| MALE | 3729.6 |
filter (where condition)applies to inputs
Exercise
Write a query that uses filter to calculate the average body masses
of heavy penguins (those over 4500 grams)
and light penguins (those under 3500 grams)
simultaneously.
Is it possible to do this using where instead of filter?
Check Understanding
Creating In-memory Database
sqlite3 :memory:
- "Connect" to an in-memory database
- Changes aren't saved to disk
- Very useful for testing (discussed later)
Creating Tables
create table job (
name text not null,
billable real not null
);
create table work (
person text not null,
job text not null
);
create table namefollowed by parenthesized list of columns- Each column is a name, a data type, and optional extra information
- E.g.,
not nullprevents nulls from being added
- E.g.,
.schemais not standard SQL- SQLite has added a few things
create if not exists- upper-case keywords (SQL is case insensitive)
Following Along
- Use
work_job.dbfrom the zip file
Inserting Data
insert into job values
('calibrate', 1.5),
('clean', 0.5);
insert into work values
('mik', 'calibrate'),
('mik', 'clean'),
('mik', 'complain'),
('po', 'clean'),
('po', 'complain'),
('tay', 'complain');
| name | billable |
|-----------|----------|
| calibrate | 1.5 |
| clean | 0.5 |
| person | job |
|--------|-----------|
| mik | calibrate |
| mik | clean |
| mik | complain |
| po | clean |
| po | complain |
| tay | complain |
Exercise
Using an in-memory database,
define a table called notes with two text columns author and note
and then add three or four rows.
Use a query to check that the notes have been stored
and that you can (for example) select by author name.
What happens if you try to insert too many or too few values into notes?
What happens if you insert a number instead of a string into the note field?
Updating Rows
update work
set person = 'tae'
where person = 'tay';
| person | job |
|--------|-----------|
| mik | calibrate |
| mik | clean |
| mik | complain |
| po | clean |
| po | complain |
| tae | complain |
- (Almost) always specify row(s) to update using
where- Otherwise update all rows in table, which is usually not wanted
Deleting Rows
delete from work
where person = 'tae';
select * from work;
| person | job |
|--------|-----------|
| mik | calibrate |
| mik | clean |
| mik | complain |
| po | clean |
| po | complain |
- Again, (almost) always specify row(s) to delete using
where
Exercise
What happens if you try to delete rows that don't exist
(e.g., all entries in work that refer to juna)?
Backing Up
create table backup (
person text not null,
job text not null
);
insert into backup
select
person,
job
from work
where person = 'tae';
delete from work
where person = 'tae';
select * from backup;
| person | job |
|--------|----------|
| tae | complain |
- We will explore another strategy based on tombstones below
Exercise
Saving and restoring data as text:
-
Re-create the
notestable in an in-memory database and then use SQLite's.outputand.dumpcommands to save the database to a file callednotes.sql. Inspect the contents of this file: how has your data been stored? -
Start a fresh SQLite session and load
notes.sqlusing the.readcommand. Inspect the database using.schemaandselect *: is everything as you expected?
Saving and restoring data in binary format:
-
Re-create the
notestable in an in-memory database once again and use SQLite's.backupcommand to save it to a file callednotes.db. Inspect this file usingod -c notes.dbor a text editor that can handle binary data: how has your data been stored? -
Start a fresh SQLite session and load
notes.dbusing the.restorecommand. Inspect the database using.schemaandselect *: is everything as you expected?
Check Understanding
Combining Information
select *
from work cross join job;
| person | job | name | billable |
|--------|-----------|-----------|----------|
| mik | calibrate | calibrate | 1.5 |
| mik | calibrate | clean | 0.5 |
| mik | clean | calibrate | 1.5 |
| mik | clean | clean | 0.5 |
| mik | complain | calibrate | 1.5 |
| mik | complain | clean | 0.5 |
| po | clean | calibrate | 1.5 |
| po | clean | clean | 0.5 |
| po | complain | calibrate | 1.5 |
| po | complain | clean | 0.5 |
| tay | complain | calibrate | 1.5 |
| tay | complain | clean | 0.5 |
- A join combines information from two tables
- cross join constructs their cross product
- All combinations of rows from each
- Result isn't particularly useful:
jobandnamevalues don't match- I.e., the combined data has records whose parts have nothing to do with each other
Inner Join
select *
from work inner join job
on work.job = job.name;
| person | job | name | billable |
|--------|-----------|-----------|----------|
| mik | calibrate | calibrate | 1.5 |
| mik | clean | clean | 0.5 |
| po | clean | clean | 0.5 |
- Use
table.columnnotation to specify columns- A column can have the same name as a table
- Use
on conditionto specify join condition - Since
complaindoesn't appear injob.name, none of those rows are kept
Exercise
Re-run the query shown above using where job = name instead of the full table.name notation.
Is the shortened form easier or harder to read
and more or less likely to cause errors?
Aggregating Joined Data
select
work.person,
sum(job.billable) as pay
from work inner join job
on work.job = job.name
group by work.person;
| person | pay |
|--------|-----|
| mik | 2.0 |
| po | 0.5 |
- Combines ideas we've seen before
- But Tay is missing from the table
- No records in the
jobtable withtayas name - So no records to be grouped and summed
- No records in the
Left Join
select *
from work left join job
on work.job = job.name;
| person | job | name | billable |
|--------|-----------|-----------|----------|
| mik | calibrate | calibrate | 1.5 |
| mik | clean | clean | 0.5 |
| mik | complain | | |
| po | clean | clean | 0.5 |
| po | complain | | |
| tay | complain | | |
- A left outer join keeps all rows from the left table
- Fills missing values from right table with null
Aggregating Left Joins
select
work.person,
sum(job.billable) as pay
from work left join job
on work.job = job.name
group by work.person;
| person | pay |
|--------|-----|
| mik | 2.0 |
| po | 0.5 |
| tay | |
- That's better, but we'd like to see 0 rather than a blank
Coalescing Values
select
work.person,
coalesce(sum(job.billable), 0.0) as pay
from work left join job
on work.job = job.name
group by work.person;
| person | pay |
|--------|-----|
| mik | 2.0 |
| po | 0.5 |
| tay | 0.0 |
coalesce(val1, val2, …)returns first non-null value
Full Outer Join
- Full outer join is the union of left outer join and right outer join
- Almost the same as cross join, but consider:
create table size (
s text not null
);
insert into size values ('light'), ('heavy');
create table weight (
w text not null
);
select * from size full outer join weight;
| s | w |
|-------|---|
| light | |
| heavy | |
- A cross join would produce empty result
Exercise
Find the least time each person spent on any job.
Your output should show that mik and po each spent 0.5 hours on some job.
Can you find a way to show the name of the job as well
using the SQL you have seen so far?