Failure Mode

Missing Transitions

The model loves the next meaningful scene. The harness needs the boring middle.

Every image in this article is generated. John is an emergent fictional character from an image-model continuity experiment, not a real person.

The Strange Moment

The clearest failure in John’s World is not when the image looks wrong. It is when the image looks too ready.

John is supposed to notice a message. The next valid beat is tiny: move a hand toward the phone pocket, pull the phone partly out, lift it, then read. The harness even breaks the action into micro-movements because earlier runs kept jumping ahead. This is not a request for cinema. This is a request for a man to begin taking a phone from his pocket.

The model keeps trying to overachieve.

In exploration 07, the intended beat is simple: John lifts the phone toward viewing height, and the screen is not yet fully readable. The rendered frame gives us a readable message context too soon. The validator calls it macro-overcompleted. That phrase sounds like a corporate performance review for a system that cannot stop being helpful.

The problem is familiar if you have worked with agents. The model sees the intent and rushes to the satisfying part. It does not want to spend a frame on partial extraction, unreadable screen, no reply yet. It wants to get to the message.

Generated phone-screen frame that advanced too far into a message interaction.
Exploration 07, frame 0077. The rendered frame got too far ahead of the required microbeat, triggering a macro-overcompleted rejection.

What the System Was Trying to Do

The harness was trying to slow the world down. Instead of asking for a full action, it generated microbeats. Reach toward pocket. Grip phone. Pull phone partly out. Lift phone. Focus screen. Read. Reply only after the message is actually visible and understood.

This is the right instinct. Generated worlds fail when they summarize action as if they are writing a caption. Humans can infer the boring parts, but a persistent world has to account for them. If the phone is already unlocked and the message is already read, then the system has skipped state changes that later logic depends on.

In these later runs, the harness starts rejecting frames. It marks some as missing the required beat. It marks others as macro-overcompleted. It holds previous state when a render jumps too far. That is progress, even if it feels like arguing with a very confident intern who keeps turning in chapter three when you asked for the first sentence.

Run note: Exploration 06 contains 36 invalid frames. Exploration 07 contains 27. That is not failure in the useless sense. It is the validator finally seeing the category of mistake clearly enough to say no.

Generated frame of John near the garage threshold where the required phone microbeat was not satisfied.
Exploration 06, frame 0072 pending. The harness rejected the frame because the required beat was missing.

What Broke

The transition layer broke.

The model is good at endpoints. Person has phone. Person reads message. Person is at store. Person is in parking lot. Person returns home. Each endpoint is visually familiar. The expensive part is the chain that proves how one endpoint became the next.

When that chain is missing, object tracking gets fuzzy. The phone can become visible before it was extracted. A message can become read before the screen was readable. A store can arrive before the drive. A reply can be implied before the outbound action. The world still looks plausible, but the ledger has bad math.

This is why I do not trust pretty frames by themselves. A single frame can pass the eye test and still poison the run. If the image includes future evidence, the next action planner inherits a lie. Then the harness either has to repair it or pretend the world earned something it did not.

Generated image of John near a store after a transition-heavy errand sequence.
Exploration 05, frame 0074. Store, basket, vehicle, and phone context all compete for continuity.

Why It Is Interesting

Missing transitions are the hidden cost of generated worlds. They are easy to ignore because the model can make the destination feel natural. If John is at a store, your brain fills in the drive. If he is holding the phone, your brain fills in the pocket movement. If the message is visible, your brain fills in the unlock.

A harness cannot afford that generosity. It has to know which facts are observed, which are inferred, and which are not allowed yet. Otherwise the world becomes a series of plausible after-the-fact explanations.

This is the same reason agent systems need trace discipline. If the final answer appears without the intermediate evidence, you may still like the answer. But you do not know whether the system reasoned, guessed, or skipped. John’s World makes that visible because the missing reasoning becomes a missing door, missing drive, missing pocket action, or missing object handoff.

Next Harness Change

The next harness should treat microbeats as hard contracts. If the selected beat says the phone screen is not readable yet, then a readable message is future evidence and the frame should be rejected. If the drive has not happened, the store cannot appear unless the transition is explicitly summarized and committed.

It also needs a clearer distinction between observed facts and inferred facts. “John probably has the keys” is useful. “John visibly pocketed the keys” is stronger. “John is now allowed to drive” is a policy decision. Those should not collapse into the same blob of text.

That is the next serious version of this experiment: less magic, more receipts. Which, for a generated suburban world, feels thematically appropriate.