The primitive is a family, not a model
The sharp idea in nanochat miniseries v1 is that a training system should produce a family of models controlled by one spend dial. That lets the builder reason about scaling behavior before paying for the large run.
In nanochat, that dial is --depth. The README describes depth as the single complexity control that automatically determines related transformer dimensions, training horizon, and other hyperparameters. That matters because it turns small-model work from bespoke tuning into a controlled experiment.
Why miniseries thinking matters
Karpathy's miniseries write-up describes depth sweeps from d10 through d20, with d12 as a favorite quick experiment scale and d24-d26 around GPT-2 capability in the current code. The point is not the exact number; the point is that each depth should be arranged to be compute-optimal for its budget.
That is the discipline missing from a lot of small-model experimentation. A random small model can look cute. A miniseries tells you whether the pipeline, data, architecture, optimizer, and evaluation are improving the curve.
Validation loss is not enough
The miniseries write-up also makes a useful move away from comparing only validation loss. For nanochat's GPT-2 goal, the repo leans on the DCLM CORE metric so the comparison is tied to a broader capability score instead of a single loss curve.
DataComp-LM is relevant because it treats data curation, training recipes, and broad downstream evaluation as connected parts of the same system. That is the right model-training lesson: the metric is part of the product architecture.
What this means for practical model training
- Expose one control dial: make size and spend legible to non-research stakeholders.
- Compare families: avoid declaring victory from one lucky checkpoint.
- Use metrics that survive transfer: do not let validation loss become the only scoreboard.
- Require improvements across scale: a change that only helps one depth may be a local trick.