Instead of caching actual code, we cache a "plan" of specific web actions that are still described in natural language.
For example, a cached "typing" action might look like: { variant: 'type'; target: string; content: string; }
The target is a natural language description. The content is what to type. Moondream's job is simply to find the target, and then we will click into that target and type whatever content. This means it can be full vision and not rely on DOM at all, while still being very consistent. Moondream is also trivially cheap to run since it's only a 2B model. If it can't find the target or it's confidence changed significantly (using token probabilities), it's an indication that the action/plan requires adjustment, and we can dynamically swap in the planner LLM to decide how to adjust the test from there.
We have multiple fallbacks to prevent flakes; The "cheap" command, a description of the intended step, and the original prompt.
If any step fails, we fall back to the next source.