Nanochat / SLM Series - Part 2

The Depth Dial and the Compute-Optimal Miniseries

nanochat turns small-model training into a family of comparable runs: one depth dial, compute-aware horizons, and metrics that make improvements visible.

Part 2 of 4. nanochat makes the model family visible: a single depth dial, repeatable training, and a miniseries of compute-optimal checkpoints instead of one heroic run.
Miniature terracotta workbench with a central dial and stepped stacks of clay blocks

The primitive is a family, not a model

The sharp idea in nanochat miniseries v1 is that a training system should produce a family of models controlled by one spend dial. That lets the builder reason about scaling behavior before paying for the large run.

In nanochat, that dial is --depth. The README describes depth as the single complexity control that automatically determines related transformer dimensions, training horizon, and other hyperparameters. That matters because it turns small-model work from bespoke tuning into a controlled experiment.

Builder translation: the goal is not merely to train a smaller model. The goal is to create a repeatable curve that says what more compute buys, where gains flatten, and which changes improve every depth.
Close view of a clay dial connected to progressively taller block stacks

Why miniseries thinking matters

Karpathy's miniseries write-up describes depth sweeps from d10 through d20, with d12 as a favorite quick experiment scale and d24-d26 around GPT-2 capability in the current code. The point is not the exact number; the point is that each depth should be arranged to be compute-optimal for its budget.

That is the discipline missing from a lot of small-model experimentation. A random small model can look cute. A miniseries tells you whether the pipeline, data, architecture, optimizer, and evaluation are improving the curve.

Validation loss is not enough

The miniseries write-up also makes a useful move away from comparing only validation loss. For nanochat's GPT-2 goal, the repo leans on the DCLM CORE metric so the comparison is tied to a broader capability score instead of a single loss curve.

DataComp-LM is relevant because it treats data curation, training recipes, and broad downstream evaluation as connected parts of the same system. That is the right model-training lesson: the metric is part of the product architecture.

What this means for practical model training

Sources behind the argument