Loading Survey Data
Importing data into the database
Concepts
- A CSV file stores tabular data as plain text:
one row per line with fields separated by commas
csv.DictReaderyields each row as a dictionary keyed by the column names in the header row- Show me how to read a CSV file using csv.DictReader and print each row as a dictionary
- All values in a CSV file are strings
- Type conversion (e.g.,
float(row["diameter"])) is required before storing values in a typed column - Missing or malformed values must be handled gracefully
- What are some different ways to handle missing or malformed values when reading data?
- Type conversion (e.g.,
- Wrapping a bulk import in a transaction prevents a partially loaded dataset
from being left in the database
- What happens if an import wrapped in a transaction fails halfway through?
executemanypasses a list of value tuples to a single prepared SQL statement- Faster than calling
executein a loop because it parses and plans the query only once - What differences are there between calling
executeandexecutemany?
- Faster than calling
- Idempotency: a script is idempotent if running it multiple
times produces the same result as running it once
- A data loader that is not idempotent inserts duplicate rows every time it runs
- How do I make a data loading script idempotent so that running it twice does not insert duplicate rows?
What snailz provides
- The snailz package and the CSV files it generates:
sites, surveys, and measurements
- Print a summary of the CSV files the snailz package generates and the columns each contain as JSON
- Reading the documentation rather than guessing at column names
- Write a function that checks the JSON generated by the previous function against the actual files, reporting discrepancies
Parsing a CSV with the standard library
csv.DictReader, handling missing columns, and converting types- Write a function that uses csv.DictReader to parse a snail measurements CSV file and convert each row to a typed dataclass instance
- Alternative: use Polars
- Write a function to read a CSV file using Polars, returning a dataframe instead of a list of dataclass instances
Bulk insertion
- Inserting many rows efficiently with
executemanyvs. a loop- Write a function that inserts every dataclass instance in a list into a database table
- Wrapping the import in a transaction so it succeeds or fails atomically
- Refactor the function from the previous prompt to run in a transaction
- Counting rows and spot-checking a few values to confirm the import worked
- Add checks to the function to ensure that the right number of rows were inserted
A taskipy task for data loading
- A taskipy task that runs the importer on a fresh database
- Write a taskipy task that drops and recreates the database and then runs the data loader
- Idempotency: what happens if you run it twice, and how to handle that
- Modify the function so that it fails without modifying the database if the rows have already been inserted
Check for Understanding
What does it mean for a data-loading script to be idempotent, and why does it matter?
An idempotent script produces the same result no matter how many times you run it. For a data loader, this typically means that if the data is already present, do not add it again. It matters because loaders are often re-run after failures, schema changes, or data corrections. A non-idempotent loader silently doubles the data on each run, making results inconsistent.
What goes wrong if you convert a CSV string field to int without catching exceptions?
If the field is missing (an empty string) or contains a non-numeric value like "N/A",
int("") raises a ValueError and the entire import fails,
potentially mid-transaction.
You should wrap conversions in try/except, decide on a fallback value
(e.g., None for nullable fields),
and log a warning so you know which rows had problems.
Why is executemany faster than calling execute in a loop?
Each call to execute sends the SQL to the database engine,
which parses and plans the query before running it.
executemany parses and plans the statement once,
then runs it repeatedly for each set of values.
For thousands of rows the difference is substantial.
If the import transaction fails halfway through, how many rows end up in the database?
None. When a transaction rolls back, all changes made during the transaction are undone, leaving the database in exactly the state it was in before the transaction started. This is the whole point of wrapping the import in a transaction: partial imports are worse than no import because they are hard to detect and correct.
Exercises
Validate before inserting
Before inserting each row, check that all required fields are present and that numeric fields contain valid numbers. Collect warnings for rows that fail validation and print them after the import completes rather than stopping on the first error. Decide whether rows with validation failures should be skipped or should halt the import, and explain your reasoning.
Handle duplicate runs
Modify the loader so that running it twice does not insert duplicate rows.
Choose between two approaches:
insert or ignore (which skips rows that would violate a unique constraint)
or deleting all existing data before reloading.
Implement one approach and explain why you chose it.
Add progress reporting
Add a counter that prints the number of rows inserted every 100 rows, and prints the total on completion. Prompt the LLM to help and check that it does not use any external libraries not already in the project.
Load a second file
The snailz package generates both a measurements file and a sites file. Add a loader for the sites CSV and run both loaders in sequence. Before prompting the LLM for help, write down which loader must run first and why. Verify your answer is correct after the implementation is complete.