Polars

select Drops Unlisted Columns

Run the script and look at the columns in the result. How many columns does it have? How many columns did the original DataFrame have?

import polars as pl

df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Carol"],
    "score": [88, 72, 95],
})

result = df.select([
    (pl.col("score") * 1.1).alias("adjusted"),
])
print(result)

Show explanation

The bug is using .select() to add a new column. .select() returns only the columns listed in the call and drops all others, so id and name are lost. Shows the difference between .select() (choose columns) and .with_columns() (add or replace columns while keeping the rest).

Lazy Frame Never Collected

Run the script and look at what is printed. Is the output a table of data, or something else? What type does type(result) report?

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "score": [88, 72, 95, 61],
})

result = (
    df.lazy()
    .filter(pl.col("score") >= 80)
    .select(["name", "score"])
)
print(type(result))
print(result)

Show explanation

The bug is calling .lazy() to start a lazy pipeline but never calling .collect() at the end, so the result is a LazyFrame (a query plan) rather than a DataFrame. The filter and select have not executed. Shows the difference between eager and lazy evaluation in Polars and when .collect() is required.

Null Comparison with == None

Run the script. How many rows does it print? How many rows contain a missing score?

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "score": [88, None, 95, None],
})

result = df.filter(pl.col("score") == None)  # noqa: E711
print(f"rows with missing score: {len(result)}")
print(result)

Show explanation

The bug is using == None to test for missing values. In Polars, any comparison involving null yields null rather than a boolean, so every row in the filtered result is null and the filter keeps nothing. Shows null semantics in Polars and how to use .is_null() to correctly select rows where a value is missing.

Float Cast Truncates Instead of Rounding

Run the script and compare the values in the score column to the expected rounded values printed below them. What happened to 88.7 and 95.6?

import polars as pl

df = pl.DataFrame({
    "id": [1, 2, 3, 4],
    "score": [88.7, 72.3, 95.6, 61.9],
})

result = df.with_columns(pl.col("score").cast(pl.Int64))
print(result)
print("expected if rounded:", [round(x) for x in [88.7, 72.3, 95.6, 61.9]])

Show explanation

The bug is that cast(pl.Int64) truncates toward zero rather than rounding, so 88.7 becomes 88 and 95.6 becomes 95 instead of 89 and 96. No error is raised. Shows that integer casting in Polars is a truncation operation, and how to use .round(0).cast(pl.Int64) when rounding behavior is intended.

group_by Collapses Rows

Run the script. How many rows does the output have? How many rows did you expect?

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "dept": ["eng", "eng", "hr", "hr", "eng"],
    "salary": [90000, 85000, 70000, 75000, 92000],
})

result = df.group_by("dept").agg(pl.col("salary").mean().alias("dept_mean"))
print(f"input rows: {len(df)}, output rows: {len(result)}")
print(result)

Show explanation

The bug is using group_by().agg() when the goal is to add a per-row column showing each employee's department mean. group_by().agg() collapses the DataFrame to one row per group. Shows the difference between aggregation (which reduces rows) and window functions (which compute a value per row), and how to use .over() inside .with_columns() to attach group statistics to every row.

Default Inner Join Drops Rows

Run the script and compare the number of input orders to the number of rows in the result. Which order is missing, and why?

import polars as pl

orders = pl.DataFrame({
    "order_id": [1, 2, 3, 4],
    "customer_id": [10, 20, 30, 99],
    "amount": [50.0, 75.0, 30.0, 20.0],
})

customers = pl.DataFrame({
    "customer_id": [10, 20, 30],
    "name": ["Alice", "Bob", "Carol"],
})

result = orders.join(customers, on="customer_id")
print(f"orders in input : {len(orders)}")
print(f"rows after join : {len(result)}")
print(result)

Show explanation

The bug is using the default join, which is inner. Order 4 has a customer_id of 99 that does not appear in the customers table, so it is silently dropped. Shows the difference between inner and left joins, how to specify how="left" to retain all rows from the left table, and how to verify row counts before and after a join.

Missing alias on Computed Column

Run the script and compare the column names and values to the original DataFrame. Which column was overwritten, and which column was supposed to be added?

import polars as pl

df = pl.DataFrame({
    "price": [10.0, 20.0, 30.0],
    "qty": [3, 1, 2],
})

result = df.with_columns(pl.col("price") * pl.col("qty"))
print(result)

Show explanation

The bug is omitting .alias() on the expression. Without an explicit name, Polars assigns the column the name of the left operand ("price"), which silently replaces the original price column with the product values instead of adding a new total column. Shows how Polars names unnamed expressions and why .alias() is needed whenever the result should have a different name from its inputs.

Day and Month Swapped in Date Format

Run the script and look at the parsed date column. Is "03/04/2024" shown as April 3rd or March 4th? What date was intended?

import polars as pl

df = pl.DataFrame({
    "event": ["conference", "deadline", "review"],
    "date_str": ["03/04/2024", "07/08/2024", "11/12/2024"],
})

result = df.with_columns(
    pl.col("date_str").str.to_date(format="%m/%d/%Y").alias("date")
)
print(result)

Show explanation

The bug is a mismatch between the data order (day/month/year) and the format string (%m/%d/%Y, month/day/year). Because all day and month values are 12 or below, every date parses without error, but each one is wrong. Shows how ambiguous numeric date formats cause silent data corruption, and why checking a few parsed values against known inputs is necessary to confirm the format string is correct.

explode on a String Column

Run the script and read the error message. What type does Polars report for the tags column?

import polars as pl

df = pl.DataFrame({
    "id": [1, 2],
    "tags": ["python,data,science", "web,api"],
})

result = df.explode("tags")
print(result)

Show explanation

The bug is calling .explode() on a column that contains plain strings rather than lists. Polars raises an InvalidOperationError because .explode() requires a list-type column. Shows how to convert a delimited string column into a list column with .str.split() before calling .explode(), and how to check column types with .schema before applying list operations.

Cross-Reference in One with_columns Call

Run the script and read the error message. Which column is reported as not found? Is that column present in the original DataFrame?

import polars as pl

df = pl.DataFrame({
    "price": [100.0, 200.0, 300.0],
    "tax_rate": [0.1, 0.2, 0.1],
})

result = df.with_columns([
    (pl.col("price") * 0.9).alias("discounted_price"),
    (pl.col("discounted_price") * (1 + pl.col("tax_rate"))).alias("total"),
])
print(result)

Show explanation

The bug is referencing discounted_price in the same .with_columns() call where it is first computed. All expressions in a single .with_columns() call are evaluated against the original DataFrame, so discounted_price does not yet exist when total is computed, and Polars raises a ColumnNotFoundError. Shows how Polars evaluates expressions in parallel within one call and how to chain two separate .with_columns() calls when one result depends on another.