Loading Survey Data

Importing data into the database

Concepts

What snailz provides

Parsing a CSV with the standard library

Bulk insertion

A taskipy task for data loading

Check for Understanding

What does it mean for a data-loading script to be idempotent, and why does it matter?

An idempotent script produces the same result no matter how many times you run it. For a data loader, this typically means that if the data is already present, do not add it again. It matters because loaders are often re-run after failures, schema changes, or data corrections. A non-idempotent loader silently doubles the data on each run, making results inconsistent.

What goes wrong if you convert a CSV string field to int without catching exceptions?

If the field is missing (an empty string) or contains a non-numeric value like "N/A", int("") raises a ValueError and the entire import fails, potentially mid-transaction. You should wrap conversions in try/except, decide on a fallback value (e.g., None for nullable fields), and log a warning so you know which rows had problems.

Why is executemany faster than calling execute in a loop?

Each call to execute sends the SQL to the database engine, which parses and plans the query before running it. executemany parses and plans the statement once, then runs it repeatedly for each set of values. For thousands of rows the difference is substantial.

If the import transaction fails halfway through, how many rows end up in the database?

None. When a transaction rolls back, all changes made during the transaction are undone, leaving the database in exactly the state it was in before the transaction started. This is the whole point of wrapping the import in a transaction: partial imports are worse than no import because they are hard to detect and correct.

Exercises

Validate before inserting

Before inserting each row, check that all required fields are present and that numeric fields contain valid numbers. Collect warnings for rows that fail validation and print them after the import completes rather than stopping on the first error. Decide whether rows with validation failures should be skipped or should halt the import, and explain your reasoning.

Handle duplicate runs

Modify the loader so that running it twice does not insert duplicate rows. Choose between two approaches: insert or ignore (which skips rows that would violate a unique constraint) or deleting all existing data before reloading. Implement one approach and explain why you chose it.

Add progress reporting

Add a counter that prints the number of rows inserted every 100 rows, and prints the total on completion. Prompt the LLM to help and check that it does not use any external libraries not already in the project.

Load a second file

The snailz package generates both a measurements file and a sites file. Add a loader for the sites CSV and run both loaders in sequence. Before prompting the LLM for help, write down which loader must run first and why. Verify your answer is correct after the implementation is complete.