upvote
I’ve had a similar experience building a geometry/woodworking-flavored web app with Three.js and SVG rendering. It’s been kind of wild how quickly the SOTA models let me approach a new space in spatial development and rendering 3d (or SA optimization approaches, for that matter). That said, there are still easy "3d app" mistakes it makes like z-axis flipping or misreading coordinate conventions. But these models make similar mistakes with CSS and page awareness. Both require good verification loops to be effective.
reply
I think there is a pattern. It has a hard time with temporal and spatial.

Temporal. I had a research project where the LLM had no concept about preventing data from the future to leak in. I eventually had to create a wall clock and an agent that would step through every line of code and ensure by writing that lines logic and why there is no future of the wall clock data leaking.

Spatial. I created a canvas for rendering thinking model's attention and feedforward layers for data visualization animations. It was having a hard time working with it until I pointed Opus 4.7 to some ancient JavaScript code [0] about projecting 3d to 2d and after searching Github repositories. It worked perfect with pan zoom in one shot after that.

No matter how hard I tried I couldn't get it to stack all the layers correctly. It must have remembered all the parts for projecting 3d to 2d because it could not figure out how to position the layers.

There is a ton of information burnt into the weights during training but it can not reason about it. When it does work well with spatial and temporal it is more slight of hand than being able to generalize.

People say, why not just do reinforcement learning? That can't generalize in the same way a LLM can. I'm thinking about doing the Rubik's Cube because if people can solve that it might open up solutions for working temporal and spatial problems.

[0] https://jakesgordon.com/writing/javascript-racer-v1-straight...

reply
> I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.

Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!

reply
Interesting (would like to hear more), but solving a Rubiks cube would appear to be a poor way to measure spatial understanding or reasoning. Ordinary human spatial intuition lets you think about how to move a tile to a certain location, but not really how to make consistent progress towards a solution; what's needed is knowledge of solution techniques. I'd say what you're measuring is 'perception' rather than reasoning.
reply
> what's needed is knowledge of solution techniques

That's definitely in the training data

reply
What about a model designed for robotics and vision? Seems like an LLM trained on text would inherently not be great for this.

DeepMinds other models however might do better?

reply
How are you handing the cube state to the model?
reply
Does this answer the question?

Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.

https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...

edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.

reply
I wonder if the difficulties LLMs have with “seeing” complex detail in images is muddying the problem here. What if you hand it the cube state in text form? (You could try ascii art if you want a middle ground.)

If you want to isolate the issue, try getting the LLM itself to turn the images into a text representation of the cube state and check for accuracy. If it can’t see state correctly it certainly won’t be able to solve.

reply
Can't they write a script to solve rubik cubes?
reply
That doesn't test whether the model can follow and execute a dynamic plan reliably.
reply
deleted
reply
I bet I can even do it with the smallest gemma 4 model using a prompt of max 500 characters.
reply
*yet
reply