upvote
I read the article and had the same question. It's written in such a way that it feels like it's answering these questions without actually doing so.

The right thing to benchmark against isn't a regular transformer, it's a transformer that writes programs that are then interpreted. They have a little visual demo where it looks faster but only because they make Python absurdly slow, and it's clearly not meant to be a real benchmark.

I spent the whole article thinking, wow, cool, but also ... how is this better than an LLM steering a regular computer? The closest we get is a statement about the need to "internalize what computation is" which doesn't say anything to me.

Fundamentally, running actual instructions on a real CPU is always going to be faster than running them via a neural network. So the interesting part is where they say you can backprop through it, but, ok, backprop is for cases where we don't know how to encode a function using strict logic. Why would you try and backprop through a Sudoku solver? It's probably my imagination is just limited but I could have used more on that.

reply
Benchmark it against a fast Python interpreter optimized for AI tool calling, like Monty: https://github.com/pydantic/monty
reply
Did you read the post you are responding to? It says:

> What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?

The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?"

There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here.

reply
> There are no details about training

my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.

ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.

reply
This is my interpretation as well.

EDIT: Actually, they do make this clear(ish) at the very end of the article, technically. But there is a huge amount of vagueness and IMO outright misleading / deliberately deceptive stuff early on (e.g. about potential differentiability of their approach, even though they admit later they aren't sure if the differentiable approach can actually work for what they are doing). It is hard to tell what they are actually claiming unless you read this autistically / like a lawyer, but that's likely due to a lack of human editing and too much AI assistance.

reply