The cost curve moved
Karpathy frames the GPT-2 comparison as a cost-curve event. GPT-2's largest 2019 training run is estimated around $43,000 in cloud TPU cost; nanochat demonstrates GPT-2-level capability from scratch on a single 8xH100 node for well under $100 in the reported runs.
The README gives the practical version: train a GPT-2 capability model, then talk to it through a ChatGPT-like UI. The exact leaderboard number will change, but the strategic point does not: a model capability that used to be an institution-level expense is becoming a repeatable lab exercise.
Cheap does not mean casual
A low-dollar run can still waste attention if the target is vague. nanochat's speedrun is useful because it defines the target: beat GPT-2's CORE score on a specific hardware setup and track wall-clock training time.
That is the part to copy. If a team wants to train a small model, the first move is not to rent GPUs. It is to define the scoreboard tightly enough that a cheaper training loop becomes a learning loop instead of a novelty expense.
Data is part of the economics
The DCLM paper emphasizes controlled dataset experiments, data curation, and broad downstream evaluation. That matters because cheaper training makes more experiments possible, but bad data still compounds into bad models.
Phi-3 points in the same direction from another angle: small models can become highly capable when trained on filtered public data, synthetic data, and alignment work. Small does not mean under-designed.
What cheap repeatable training changes
- More experiments become affordable: teams can test model/data choices instead of debating them abstractly.
- Evaluation becomes the limiter: without a useful scoreboard, cheaper runs only create more artifacts to inspect.
- Data work gets leverage: curation quality can matter more than another round of prompt ceremony.
- Specialists become plausible: a narrow model can earn a place in a workflow when training and rollback are cheap enough.