Polars
select Drops Unlisted Columns
Run the script and look at the columns in the result. How many columns does it have? How many columns did the original DataFrame have?
import polars as pl
df = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Carol"],
"score": [88, 72, 95],
})
result = df.select([
(pl.col("score") * 1.1).alias("adjusted"),
])
print(result)
Show explanation
The bug is using .select() to add a new column. .select() returns only the
columns listed in the call and drops all others, so id and name are lost.
Shows the difference between .select() (choose columns) and .with_columns()
(add or replace columns while keeping the rest).
Lazy Frame Never Collected
Run the script and look at what is printed. Is the output a table of data, or
something else? What type does type(result) report?
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"score": [88, 72, 95, 61],
})
result = (
df.lazy()
.filter(pl.col("score") >= 80)
.select(["name", "score"])
)
print(type(result))
print(result)
Show explanation
The bug is calling .lazy() to start a lazy pipeline but never calling .collect()
at the end, so the result is a LazyFrame (a query plan) rather than a DataFrame.
The filter and select have not executed. Shows the difference between eager and
lazy evaluation in Polars and when .collect() is required.
Null Comparison with == None
Run the script. How many rows does it print? How many rows contain a missing score?
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"score": [88, None, 95, None],
})
result = df.filter(pl.col("score") == None) # noqa: E711
print(f"rows with missing score: {len(result)}")
print(result)
Show explanation
The bug is using == None to test for missing values. In Polars, any comparison
involving null yields null rather than a boolean, so every row in the filtered
result is null and the filter keeps nothing. Shows null semantics in Polars and
how to use .is_null() to correctly select rows where a value is missing.
Float Cast Truncates Instead of Rounding
Run the script and compare the values in the score column to the expected rounded
values printed below them. What happened to 88.7 and 95.6?
import polars as pl
df = pl.DataFrame({
"id": [1, 2, 3, 4],
"score": [88.7, 72.3, 95.6, 61.9],
})
result = df.with_columns(pl.col("score").cast(pl.Int64))
print(result)
print("expected if rounded:", [round(x) for x in [88.7, 72.3, 95.6, 61.9]])
Show explanation
The bug is that cast(pl.Int64) truncates toward zero rather than rounding, so
88.7 becomes 88 and 95.6 becomes 95 instead of 89 and 96. No error is raised.
Shows that integer casting in Polars is a truncation operation, and how to use
.round(0).cast(pl.Int64) when rounding behavior is intended.
group_by Collapses Rows
Run the script. How many rows does the output have? How many rows did you expect?
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
"dept": ["eng", "eng", "hr", "hr", "eng"],
"salary": [90000, 85000, 70000, 75000, 92000],
})
result = df.group_by("dept").agg(pl.col("salary").mean().alias("dept_mean"))
print(f"input rows: {len(df)}, output rows: {len(result)}")
print(result)
Show explanation
The bug is using group_by().agg() when the goal is to add a per-row column showing
each employee's department mean. group_by().agg() collapses the DataFrame to one row
per group. Shows the difference between aggregation (which reduces rows) and window
functions (which compute a value per row), and how to use .over() inside
.with_columns() to attach group statistics to every row.
Default Inner Join Drops Rows
Run the script and compare the number of input orders to the number of rows in the result. Which order is missing, and why?
import polars as pl
orders = pl.DataFrame({
"order_id": [1, 2, 3, 4],
"customer_id": [10, 20, 30, 99],
"amount": [50.0, 75.0, 30.0, 20.0],
})
customers = pl.DataFrame({
"customer_id": [10, 20, 30],
"name": ["Alice", "Bob", "Carol"],
})
result = orders.join(customers, on="customer_id")
print(f"orders in input : {len(orders)}")
print(f"rows after join : {len(result)}")
print(result)
Show explanation
The bug is using the default join, which is inner. Order 4 has a customer_id of 99
that does not appear in the customers table, so it is silently dropped. Shows the
difference between inner and left joins, how to specify how="left" to retain all
rows from the left table, and how to verify row counts before and after a join.
Missing alias on Computed Column
Run the script and compare the column names and values to the original DataFrame. Which column was overwritten, and which column was supposed to be added?
import polars as pl
df = pl.DataFrame({
"price": [10.0, 20.0, 30.0],
"qty": [3, 1, 2],
})
result = df.with_columns(pl.col("price") * pl.col("qty"))
print(result)
Show explanation
The bug is omitting .alias() on the expression. Without an explicit name, Polars
assigns the column the name of the left operand ("price"), which silently replaces
the original price column with the product values instead of adding a new total
column. Shows how Polars names unnamed expressions and why .alias() is needed
whenever the result should have a different name from its inputs.
Day and Month Swapped in Date Format
Run the script and look at the parsed date column. Is "03/04/2024" shown as
April 3rd or March 4th? What date was intended?
import polars as pl
df = pl.DataFrame({
"event": ["conference", "deadline", "review"],
"date_str": ["03/04/2024", "07/08/2024", "11/12/2024"],
})
result = df.with_columns(
pl.col("date_str").str.to_date(format="%m/%d/%Y").alias("date")
)
print(result)
Show explanation
The bug is a mismatch between the data order (day/month/year) and the format string
(%m/%d/%Y, month/day/year). Because all day and month values are 12 or below,
every date parses without error, but each one is wrong. Shows how ambiguous numeric
date formats cause silent data corruption, and why checking a few parsed values
against known inputs is necessary to confirm the format string is correct.
explode on a String Column
Run the script and read the error message. What type does Polars report for the
tags column?
import polars as pl
df = pl.DataFrame({
"id": [1, 2],
"tags": ["python,data,science", "web,api"],
})
result = df.explode("tags")
print(result)
Show explanation
The bug is calling .explode() on a column that contains plain strings rather than
lists. Polars raises an InvalidOperationError because .explode() requires a
list-type column. Shows how to convert a delimited string column into a list
column with .str.split() before calling .explode(), and how to check column
types with .schema before applying list operations.
Cross-Reference in One with_columns Call
Run the script and read the error message. Which column is reported as not found? Is that column present in the original DataFrame?
import polars as pl
df = pl.DataFrame({
"price": [100.0, 200.0, 300.0],
"tax_rate": [0.1, 0.2, 0.1],
})
result = df.with_columns([
(pl.col("price") * 0.9).alias("discounted_price"),
(pl.col("discounted_price") * (1 + pl.col("tax_rate"))).alias("total"),
])
print(result)
Show explanation
The bug is referencing discounted_price in the same .with_columns() call where
it is first computed. All expressions in a single .with_columns() call are
evaluated against the original DataFrame, so discounted_price does not yet exist
when total is computed, and Polars raises a ColumnNotFoundError. Shows how
Polars evaluates expressions in parallel within one call and how to chain two
separate .with_columns() calls when one result depends on another.