Nanochat / SLM Series - Part 3

The Economics of Beating GPT-2 for Under $100

When a GPT-2-class training run becomes cheap enough to repeat, the frontier moves from compute access to task design, data quality, and evaluation discipline.

Part 3 of 4. The nanochat story is not only technical. It is economic: cheap repeatable training changes what small teams can afford to test.
Miniature terracotta balance scale with blank tokens, clay blocks, trays, and rails

The cost curve moved

Karpathy frames the GPT-2 comparison as a cost-curve event. GPT-2's largest 2019 training run is estimated around $43,000 in cloud TPU cost; nanochat demonstrates GPT-2-level capability from scratch on a single 8xH100 node for well under $100 in the reported runs.

The README gives the practical version: train a GPT-2 capability model, then talk to it through a ChatGPT-like UI. The exact leaderboard number will change, but the strategic point does not: a model capability that used to be an institution-level expense is becoming a repeatable lab exercise.

Frontier signal: when training cost falls this far, the scarce resource shifts from raw compute access to good tasks, good data, and honest evaluation.

Cheap does not mean casual

A low-dollar run can still waste attention if the target is vague. nanochat's speedrun is useful because it defines the target: beat GPT-2's CORE score on a specific hardware setup and track wall-clock training time.

That is the part to copy. If a team wants to train a small model, the first move is not to rent GPUs. It is to define the scoreboard tightly enough that a cheaper training loop becomes a learning loop instead of a novelty expense.

Clay trays on wooden rails repeating small blank model cubes and tokens

Data is part of the economics

The DCLM paper emphasizes controlled dataset experiments, data curation, and broad downstream evaluation. That matters because cheaper training makes more experiments possible, but bad data still compounds into bad models.

Phi-3 points in the same direction from another angle: small models can become highly capable when trained on filtered public data, synthetic data, and alignment work. Small does not mean under-designed.

What cheap repeatable training changes

Sources behind the argument