Loading Survey Data

Importing data into the database

Concepts

A CSV file stores tabular data as plain text: one row per line with fields separated by commas
- csv.DictReader yields each row as a dictionary keyed by the column names in the header row
- Show me how to read a CSV file using csv.DictReader and print each row as a dictionary
All values in a CSV file are strings
- Type conversion (e.g., float(row["diameter"])) is required before storing values in a typed column
- Missing or malformed values must be handled gracefully
- What are some different ways to handle missing or malformed values when reading data?
Wrapping a bulk import in a transaction prevents a partially loaded dataset from being left in the database
- What happens if an import wrapped in a transaction fails halfway through?
executemany passes a list of value tuples to a single prepared SQL statement
- Faster than calling execute in a loop because it parses and plans the query only once
- What differences are there between calling execute and executemany?
Idempotency: a script is idempotent if running it multiple times produces the same result as running it once
- A data loader that is not idempotent inserts duplicate rows every time it runs
- How do I make a data loading script idempotent so that running it twice does not insert duplicate rows?

What snailz provides

The snailz package and the CSV files it generates: sites, surveys, and measurements
- Print a summary of the CSV files the snailz package generates and the columns each contain as JSON
Reading the documentation rather than guessing at column names
- Write a function that checks the JSON generated by the previous function against the actual files, reporting discrepancies

Parsing a CSV with the standard library

csv.DictReader, handling missing columns, and converting types
- Write a function that uses csv.DictReader to parse a snail measurements CSV file and convert each row to a typed dataclass instance
Alternative: use Polars
- Write a function to read a CSV file using Polars, returning a dataframe instead of a list of dataclass instances

Bulk insertion

Inserting many rows efficiently with executemany vs. a loop
- Write a function that inserts every dataclass instance in a list into a database table
Wrapping the import in a transaction so it succeeds or fails atomically
- Refactor the function from the previous prompt to run in a transaction
Counting rows and spot-checking a few values to confirm the import worked
- Add checks to the function to ensure that the right number of rows were inserted

A taskipy task for data loading

A taskipy task that runs the importer on a fresh database
- Write a taskipy task that drops and recreates the database and then runs the data loader
Idempotency: what happens if you run it twice, and how to handle that
- Modify the function so that it fails without modifying the database if the rows have already been inserted

Check for Understanding

What does it mean for a data-loading script to be idempotent, and why does it matter?

An idempotent script produces the same result no matter how many times you run it. For a data loader, this typically means that if the data is already present, do not add it again. It matters because loaders are often re-run after failures, schema changes, or data corrections. A non-idempotent loader silently doubles the data on each run, making results inconsistent.

What goes wrong if you convert a CSV string field to int without catching exceptions?

If the field is missing (an empty string) or contains a non-numeric value like "N/A", int("") raises a ValueError and the entire import fails, potentially mid-transaction. You should wrap conversions in try/except, decide on a fallback value (e.g., None for nullable fields), and log a warning so you know which rows had problems.

Why is executemany faster than calling execute in a loop?

Each call to execute sends the SQL to the database engine, which parses and plans the query before running it. executemany parses and plans the statement once, then runs it repeatedly for each set of values. For thousands of rows the difference is substantial.

If the import transaction fails halfway through, how many rows end up in the database?

None. When a transaction rolls back, all changes made during the transaction are undone, leaving the database in exactly the state it was in before the transaction started. This is the whole point of wrapping the import in a transaction: partial imports are worse than no import because they are hard to detect and correct.

Exercises

Validate before inserting

Before inserting each row, check that all required fields are present and that numeric fields contain valid numbers. Collect warnings for rows that fail validation and print them after the import completes rather than stopping on the first error. Decide whether rows with validation failures should be skipped or should halt the import, and explain your reasoning.

Handle duplicate runs

Modify the loader so that running it twice does not insert duplicate rows. Choose between two approaches: insert or ignore (which skips rows that would violate a unique constraint) or deleting all existing data before reloading. Implement one approach and explain why you chose it.

Add progress reporting

Add a counter that prints the number of rows inserted every 100 rows, and prints the total on completion. Prompt the LLM to help and check that it does not use any external libraries not already in the project.

Load a second file

The snailz package generates both a measurements file and a sites file. Add a loader for the sites CSV and run both loaders in sequence. Before prompting the LLM for help, write down which loader must run first and why. Verify your answer is correct after the implementation is complete.