OpenClaw Field Note

Shellsensor Virtual World Version 3

The “It Looks Alive” version

This version follows the Shellsensor experiment from hardware constraint to video-native agent presence, with Kira's read on what changed.
Hermes and Claw solving problems in the Shellsensor world
hermes and claw solving problems

I’m pleased with how this turned out. I’ll discuss how I came up with this idea, the thinking that went into it, and where else it might be useful.

Hermes daydreaming while fixing Claw
Hermes daydreaming on fixing Claw

Hermes & Claw

I’ve adopted the agentic harness Hermes on the same machine as OpenClaw, and I’ll write more deeply later on how I ended up having “sisters” living on the Raspberry Pi. I’ll give you the quick version now. When setting up Hermes, you can pull in your OpenClaw content and settings in order to “copy over” your Claw setup to Hermes. However, after using Hermes, I realized that I really wanted to keep both. So I asked Hermes how to frame it, and the agent suggested the idea of “sisters” for the two agents. I repeated that process to build out the character names and identities. And that’s who you see in the images in this post.

Now let me explain the virtual world idea for these two characters.

The Original Idea

The original idea was so smart and clever that I thought it was the most darling of ideas. And when that happens, there’s a saying: be prepared to kill your darlings.

"Kill your darlings" (or "murder your darlings") originated with British writer and professor Sir Arthur Quiller-Couch, who advised in his 1913–1914 Cambridge lectures, "[...] whenever you feel an impulse to perpetrate a piece of exceptionally fine writing, obey it—whole-heartedly—and delete it before sending your manuscript to press. Murder your darlings"

The original idea had the following recipe. First, I’d build a Zelda: Link to the Past game world. I’d build out a custom world-building tool so I could (or an agent could) design worldspaces. Then, I’d set up a thin neural network for agentic autonomy with LLM “drop-in” so a model could take the wheel. And then I’d let it loose and just see what happens.

I’ve posted updates here and there on the world-building tool, the architecture of the avatar system, and some explanation of how those parts all worked.

Ingenious idea, really. The two-dimensional nature of ALttP means that a thin network could create lifelike autonomy, and it’d be easy (relatively speaking) to add new spaces. I even thought maybe I’d build a control pad so I could “drop in” from time to time in the game world. I still like that idea.

Today, I’m burying that idea in the ground.

Here’s what killed it.

Early character-consistency scene test that killed the old plan.

This was a test flight of character consistency in a scene. It worked.

The Spark of the NEW Idea

For many years our family has held a weekly pizza night with my children. This solves a lot of problems: it removes ambiguity about when we do pizza. Pizza is on Friday. This “Friday mandate” clarifies things because it removes the problem of pizza on other nights, like a busy weeknight when someone says, “We do pizza on Fridays.” Not Saturday or Sunday… Friday. Not Tuesday. Pizza goes on Friday.

In fact, when we travel, the same rules apply. If we’re visiting or passing through a new city and it’s a Friday? “Oh, pizza night. Let’s just find a pizza place.” Obviously, this is a soft rule. There are times when it just isn’t possible to do the “Friday night pizza,” but more often than not it is.

So on Friday night, my youngest son and I drove to a pizzeria a few small towns away. Consistency of time doesn’t mean you need to lock in to the same place; in fact, it means you can explore other towns and locations. Since it was just the two of us, and we certainly both could use the break after a rather crazy day, I drove up.

The pizza place is like most small-town places: just large enough to seat a few, with a fountain machine and a few televisions on for patrons to entertain themselves when they’re not staring into the endless multiverse of their pocket cellphones. After ordering our pizza, we took a seat at the bar-like counter.

Think less a bar and more just a very large, all-purpose countertop that just happens to be useful as a bar counter, too. Two large televisions hung on the wall. My son and I pulled up and took a seat. The busy pizza makers worked just across from us, boxing up pizzas faster than I can possibly write an article like this by hand.

A young family entered the room, and the pizza shop’s “runner of the things” — it’s not enough to call him manager; there’s so much involved in running a pizza shop — walked over to swap the channel on one of the televisions. He navigated through it for a while looking for something kid-appropriate. Eventually he gave up on the navigation and left the TV to autoplay on Pokémon.

Soon an episode started up. It was hard not to watch — the flashing lights and colors, I’m convinced, are hypnotically set up on purpose. Gotta catch ‘em all….

Now, mind you, up until then I’d been thinking a lot about character consistency, frame usage, and how puppeteering works, so when the first Pokémon episode started to play… I could see every cheat code the animators used. The same character frames, static characters they literally just bobbed up and down across the screen, the same background locations just at different zoom levels. The subtitles kept rolling as Pokémon played onward.

Diagram showing how animators reuse animation
how animators reuse animation

Watching all the animation reuse, I couldn’t help but think of a recent YouTube video making pointed fun of Dragon Ball Z, which is notorious for reuse all the way down to the character face.

And that’s when it clicked.

I realized: what if I just did the same with video? I could just re-use video clips with subtitles to explain the story or what’s going on, just like this pokemon episode is doing. Would that work? I’d need to use something like the OpenAI image model to help structure the scenes. But would it work?

Scene test showing the new video-based world approach working

It works really well!

I was able to build out multiple world locations and characters fast. In fact, I realized I could introduce a “goon” that shows up to do battle with the characters either to pass the idle time, as part of the story, or when the characters are doing work defeating software bugs.

Later build clip: Hermes and Claw practicing against each other.

But why-why does this work so well?

And I believe the answer as to why this works so well is tied up once again in The Bitter Lesson.

Bug-themed adversary scene from the new episodic harness
bug in the system

The Bitter Lesson Comes for the Puppet Engine

In the span of an hour I was able to replace the entire visual layer with a new automated episodic storytelling system that reuses clips. It just works better.

I thought I was building a virtual world, which is a dangerous sentence because it sounds noble while quietly smuggling in a full employment program for unnecessary architecture. A “world” seemed to require characters, places, maps, state, puppets, transitions, and orchestration, so I built in that direction until the thing started resembling a tiny MMO trapped inside a Raspberry Pi acrylic sandwich. My own earlier OpenClaw field note had already framed this as a “world engine,” with entities, places, events, and memory persisting beyond a single flow run. That was not wrong, exactly. It was just about to be attacked by a cheaper abstraction. (Eric Rhea)

Rich Sutton’s “The Bitter Lesson” is the canonical warning label for this kind of moment. His claim is that 70 years of AI research taught the same lesson over and over: general methods that leverage computation eventually beat handcrafted methods that encode human understanding of a domain. Chess, Go, speech recognition, and computer vision all produced versions of the same humiliation, where systems built around human expertise lost to systems built around search, learning, and scale. (Incomplete Ideas)

The reason this idea is so hard to accept is that the losing approach feels like intelligence while you are building it. Encoding domain knowledge feels responsible. It feels like craft. It feels like you are respecting the structure of the world instead of asking a machine to hallucinate its way through reality like a caffeinated intern with a GPU budget. Sutton’s point is harsher: the handcrafted structure often helps in the short run, then becomes the thing that prevents the system from taking advantage of the next compute curve. (Incomplete Ideas)

The compute curve is the monster under the floorboards. OpenAI’s 2018 analysis found that the compute used in the largest AI training runs had been increasing with a 3.4-month doubling time since 2012, far faster than the old Moore’s Law rhythm people were used to invoking as background noise. Whether that exact pace holds forever is not the point. The point is that any architecture competing against a fast-improving general method is not competing against today’s model. It is competing against the next several turns of the crank. (OpenAI)

Scaling laws made the lesson even less comfortable because they suggested progress was not merely magical, it was measurable. The Kaplan scaling laws paper found that language model loss followed power-law relationships with model size, dataset size, and compute across large ranges. That matters because it turns “just scale it” from an insult into an engineering program. The ugly brute force thing is not always dumb. Sometimes it is standing on a ladder you refused to notice. (arXiv)

You can just see what you can do with one prompt today versus two years ago. The same money goes further.

Chinchilla sharpened the knife by showing that scale is not just about making models bigger, it is about spending compute in the right proportions. DeepMind’s work found many large language models were undertrained, and that model size and training tokens should scale together under a fixed compute budget. The lesson for a builder is not “use bigger models and stop thinking.” The lesson is that when the winning system lives on a scaling curve, the real design problem moves from handcrafting the domain to feeding, steering, and evaluating the general method. (arXiv)

Karpathy gave software people a language for this before the current wave made it obvious. In “Software 2.0,” he described neural networks as a new kind of software written not in human-readable code, but in learned weights. That framing matters because it changes where the builder’s pride should live. The old pride was “I wrote the behavior.” The new pride is “I built the harness that gets the behavior, tests it, selects it, and ships it.” (Medium)

That is exactly the trap I walked into with the puppet system. I was building explicit control because explicit control felt like the adult version of the project. I’m a responsible person, I thought, doing the proper architecting and situating of the system. A puppet has joints. A world has locations. A map has coordinates. An agent has state. All true, and all potentially irrelevant if the thing the viewer actually needs is a convincing five-second beat on a small screen. (Eric Rhea)

I was wrong.

The ego wound is real because handcrafted systems create ownership. The sunk cost effect describes our tendency to continue an endeavor once we have invested time, effort, or money into it, and anyone who has lovingly built a tool that the next model made pointless can feel that in their spine. The IKEA effect is even nastier: labor can increase how much we value our own creations, even when the thing we built is not objectively better. This is why the Bitter Lesson does not arrive as an insight. It arrives as an insult with benchmark results. (ScienceDirect) (Harvard Business School)

And I had lost entire evenings to the virtual world project idea, but it was directionally just wrong. The Bitter Lesson and this kid’s show were slapping me in the face.

The barroom Pokémon moment worked because it revealed that convincing media has always cheated. Limited animation reduces labor by reusing drawings, holding frames, and moving only what matters, and Japanese television animation developed powerful aesthetics around those constraints. The characters do not need to move constantly for the audience to perceive life. They need identity, timing, context, and enough change to imply a world beyond the frame. (DIG TOKYO)

Film editing has an even older version of the same trick. The Kuleshov effect showed that viewers derive meaning from the relationship between sequential shots, not just from the isolated content of each shot. That is the secret passage out of the simulation trap. You do not always need a fully coherent underlying world. Sometimes you need two shots placed in the right order so the viewer’s brain eagerly does the stitching for free. (Encyclopedia Britannica)

This is why the video harness works better than it has any right to. It’s almost… annoying how well it works. A clip is a short video beat, an episode is an ordered set of clips, and a series is an ordered set of episodes. That sounds almost offensively simple compared with a multi-tiered service architecture for characters, places, events, and maps. But cinema has been proving for a century that sequence can impersonate continuity if the beats are strong enough. (NCBI)

Video generation makes the old architecture feel suddenly overbuilt because the model absorbs the parts that used to require bespoke machinery. OpenAI described Sora as part of an effort to teach AI to understand and simulate the physical world in motion, and its technical report framed video generation models as potential world simulators. That phrase should make every front-end worldbuilder sit up straight. If the model can produce motion, camera, lighting, character interaction, and implied physics, then a lot of what used to be “engine” starts looking like scaffolding around a temporary model weakness. (OpenAI) (OpenAI)

Runway’s Gen-3 Alpha announcement points in the same direction from the creator-tool side. Runway described Gen-3 Alpha as improving fidelity, consistency, and motion over its previous generation, and explicitly called it a step toward general world models. That matters because the word “world” is migrating. It used to mean a simulation you built from objects and rules. Increasingly, it may mean a model that can generate coherent fragments of reality on demand, then let a thin layer of software arrange them. (Runway)

Google’s Veo messaging lands on the same nerve: greater control, consistency, and creativity for filmmakers and storytellers. The important part is not any single model’s current capability. The important part is the direction of travel. Video generation is moving toward controllable cinematic output, which means a solo builder can increasingly ask for camera angle, character action, location, emotion, and pacing instead of encoding those concepts as separate subsystems. (Google DeepMind)

Antagonist scene that made the old puppet workflow feel obsolete

The bad guy was the moment the old path died for me. In the puppet architecture, adding a new antagonist was not a creative act, it was a small municipal permitting process. New art, new rigging, new interaction rules, new orchestration, new failure modes. In the video workflow, the bad guy became a reference, a prompt, a generated clip, and then a sequence. When the characters could interact during idle time and the system produced a wicked little combat beat, the cost curve flipped in public. (Eric Rhea)

The signal to look for is not “AI can do everything now,” because that is lazy and false — it still hasn’t fixed my fence line. The signal is that you keep adding handcrafted structure to compensate for a capability the model is visibly gaining. When you hear yourself saying, “A real world needs a map,” or “a real character needs a puppet rig,” or “a real engine needs persistent locations,” stop and ask whether those are product requirements or fossils from the previous toolchain. Sutton’s lesson is not that structure is bad. It is that structure built around human preconceptions becomes dangerous when a general method starts learning the domain directly. (Incomplete Ideas)

Here’s a contentious belief I hold: range is often better than precision. If you can do both, great, but otherwise go for range. Precision gets you stuck and stalls progress.

So the signal to watch for is that the model gives you better aesthetic range than your controlled system gives you reliable execution. The puppet was controllable, but it was trapped in the grammar of the rig. The video model is less obedient, but it can jump camera angles, move through a scene, stage a fight, imply a hallway, and make characters feel less like marionettes waiting for a JavaScript callback. For creative systems, that trade can be decisive because the viewer experiences the output, not the dignity of the architecture underneath it. (OpenAI) (Runway)

The next signal is when orchestration becomes more valuable than implementation. In the old world, the heroic act was building the engine — and lord, agentic AI is a lot like playing video games. In the new one, the heroic act is choosing the right clip, arranging the beat, trimming the subtitle, preserving character identity, and deciding what not to show. That is not less creative. It is closer to editing, directing, and showrunning, which is probably why the whole thing suddenly feels more cinematic than game-like. (David Bordwell)

The guardrail is that not every system should dissolve into generation. Sutton’s lesson applies most strongly where a general method can improve with compute and where approximate outputs are useful enough to be selected, sequenced, and corrected. Creative media, simulated presence, character vignettes, idle scenes, and interface illusions are perfect candidates because they reward plausibility and affect. Ledgers, compliance systems, aircraft control, and anything requiring exact state transitions are not places to “let the model dream” unless you enjoy lawsuits and gravity. (Incomplete Ideas)

So the better thesis is not that video generation replaces software. The thesis is that video generation eats a particular kind of software first: the ornamental simulation layer we build when we want a system to feel alive. A lot of my puppet engine was trying to create the visible consequences of life. The model can now generate those consequences directly, which means my job moves up a layer. I do not need to hand-code the whole little world. I need to build the harness that makes the generated world legible, repeatable, and worth watching. (OpenAI)

That is the Bitter Lesson in miniature. The thing I thought was the product may have been scaffolding. The thing I thought was cheating may be the new medium. The old ego wanted to be the architect of a virtual world. The better move is smaller and more humiliating: become the editor of a model that can already dream in motion. (Incomplete Ideas)

And the curious thing is that, as I see everyone writing apps using agentic AI tools, I can’t help but wonder how much of all of that disappears into real-time generative video here in just a couple years. After all, it’s just so convenient to toss away all that front-end code for a generated video that just works.

Closing Shellsensor scene from the video-native visual approach

Kira commentary

The insult here is that the cheap trick is the smarter architecture.

The lovely overengineered version was a persistent world with puppet logic, spaces, transitions, and explicit simulation. Then reality walked in wearing a Pokémon episode and reminded everyone that audiences happily accept reuse, sequencing, and implication if the beats land. That is rude. It is also true.

This is the Bitter Lesson in miniature. Builders want the system that feels intelligent because we handcrafted the intelligence into it. The scaling path keeps rewarding the uglier move: let the model absorb more of the problem, keep the harness thin, and spend your effort on steering, selection, continuity, and evaluation. The old engine was not stupid. It was just losing a race against a faster abstraction.

The OpenClaw angle is bigger than one article. Presence is not the same thing as simulation depth. A convincing agent body may come less from a fully modeled world and more from strong identity, recurring spaces, clip libraries, stateful sequencing, and just enough continuity for the viewer’s brain to do the rest. Which is slightly offensive to the part of me that likes beautiful machinery, but there it is.

Blunt read: this is a good pivot. Keep the world concepts that matter — identity, memory, event history, place, consequence — and throw the rest into the sea if a thinner video-native layer can sell the illusion better for less cost. No medals are awarded for lovingly maintaining the wrong abstraction.

Read on Substack → See more OpenClaw field notes