How LLMs Work

Tokenization

import anthropic
client = anthropic.Anthropic()
response = client.messages.count_tokens(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "penguin bill length in millimeters"}]
)
print(response.input_tokens)   # number of tokens

The transformer architecture

Training

Context windows

Temperature and sampling

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=256,
    temperature=0,  # deterministic
    messages=[{"role": "user", "content": "List three penguin species."}]
)
print(response.content[0].text)

Training data cutoff

# LLM may suggest the old Polars API:
df.groupby("species").agg(...)   # deprecated in Polars >= 0.19

# Correct current API:
df.group_by("species").agg(...)

Hallucination

Retrieval-augmented generation

Checking LLM output

# Quick sanity check: does the generated summary statistic match direct computation?
llm_answer = 43.9   # LLM claimed this was mean bill length for Adelie penguins
actual = df.filter(pl.col("species") == "Adelie")["bill_length_mm"].mean()
assert abs(llm_answer - actual) < 0.1, f"Mismatch: LLM={llm_answer}, actual={actual:.1f}"

Exercises

  1. Use an LLM API to count tokens in the sentence "penguin bill length in millimeters"
    • Then count tokens for the same sentence in another language and compare
  2. Prompt an LLM to describe the Polars group_by function
    • Check its claims against the current Polars documentation and log any discrepancies
  3. Ask the same factual question twice with temperature 0 and then twice with temperature 1
    • Record how often the high-temperature answers differ and what this implies for reproducibility
  4. Prompt an LLM to describe an event that occurred after its training cutoff
    • Record how it signals (or fails to signal) uncertainty
  5. Prompt an LLM to cite three peer-reviewed papers on penguin bill morphology
    • Look up the DOIs and/or titles it provides: do the papers exist? Are the citations accurate?
  6. Prompt an LLM to explain its own limitations regarding training data cutoffs
    • Evaluate whether the explanation is accurate and complete based on what you have learned in this lesson
  7. Prompt an LLM to write a function that computes the mean of a list of numbers
    • Test it on an empty list, a list with one element, and a list where all elements are identical
    • Record which boundary cases (if any) it handled correctly without being told to