Training Small Model Abilities

Part 4 of 4. Once the economics work, the practical question changes: how do you teach a small model one useful ability without turning the dataset into mush?

Miniature terracotta training bench with sorted fruit shapes, beads, blocks, blank cards, and trays

Abilities are data contracts

The strawberry counting guide is valuable because it makes the ability concrete. The task is not magic reasoning; it is a generated conversation pattern, a target behavior, and a training stage that teaches nanochat to approach a narrow problem in a repeatable way.

The guide uses synthetic conversations, prompt variation, explicit spelling, and a Python double-check to teach a small model a behavior it was bad at. That is the right mental model for small specialists: the ability is a data contract with examples, triggers, target outputs, and failure modes.

Tokenization is not incidental

The guide explicitly breaks words into characters because tokenization hides the thing the model must count. That maps directly to the RPG-state experiment: the labels, delimiters, and status channels are not superficial formatting. They are the model's interface to the world.

When tiny models fail, the failure often looks like intelligence drifting. Underneath, it is frequently an interface problem: fuzzy tokens, overloaded labels, missing negative examples, or a task shape that does not make the right intermediate state visible.

Close view of bead strings, blank tiles, fruit-shaped pieces, and clay trays connected to one small cube

Identity is also trainable behavior

The identity guide shows the same pattern applied to persona and self-description. Karpathy describes generating synthetic multi-turn conversations and mixing them into midtraining and SFT so nanochat learns what it is supposed to know about itself.

That is not just flavor. In applied systems, identity includes tool boundaries, refusal style, domain commitments, and what the assistant should claim or avoid claiming. For small models, those behaviors should be trained and evaluated like any other ability.

How this loops back to the pico model

The RPG experiment is an ability-training problem wearing a game costume. The model is not learning everything about games. It is learning a narrow transition grammar and then revealing where the grammar is under-specified.

Generate targeted examples: cover each action, status, cooldown, and boundary case.
Vary triggers without blurring labels: prompt diversity helps; semantic drift hurts.
Train recovery, not just success: malformed states and contradictory inputs belong in the eval set.
Keep the output contract inspectable: a tiny model is only useful if failures are easy to spot.

Series takeaway: nanochat makes the training loop understandable; SLM research explains why small models matter; the pico-LLM experiment shows how the same ideas become a product-sized specialist.