upvote
Diffusion language models seem poised to smash purely autoregressive models. I'm giving it 1-2 years.
reply
One appeal of it is for RL. If it ends up being a lot faster for generation, you'll be able to do a lot more RL.

If people can make RL scalable-- make it so that RL isn't just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.

If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.

reply
RL suffers both from sparsity in a huge amount of dimensions as well as convergence brittleness when it's extremely difficult to get it to converge.
reply
> If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.

I'm really curious about this, I'm but a simple client developer, so I don't actually grok some of the differences.

For lack of a better word, there's a "normie" position that "omg diffusion means it can edit!!111! big unlock!" -- I think that's cute but I also don't see it as intuitively correct. And I guess I don't even know why I don't see it that way. But regardless, it sounds like I'm correct there.

> If not, I think autoregressive models will still be preferred.

But here I get lost, at least so far, diffusion models seem strictly significantly faster, and on par with models with the same parameter count.

If that is the case, why would autoregressive models still be preferred?

Asking this also makes me realize I am treating "diffusion models are better" as a premise, if I'm asserting they're always faster and ~same quality...

reply
Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it'll take a long time to catch up.
reply
Not really. Unlike with physical goods like batteries, the hardware for training a diffusion vs an autoregressive language model is more or less exactly the same.

Although the lab that did this research (Chris Re and Tri Dao are involved) is run by the world's experts in squeezing CUDA and Nvidia hardware for every last drop of performance.

At the API level, the primary differences will be the addition of text infill capabilities for language generation. I also somewhat expect certain types of generation to be more cohesive (e.g. comedy or stories where you need to think of the punchline or ending first!)

reply
Same with digital vs analog
reply
Digital came later but beat analog at almost everything?
reply
Didn't thinking tokens resolve the most problematic part of autoregressive models (the first few tokens set the constraints the model can't overcome later) and give it a massive advantage compared to diffusion models by showing the thinking trace? I can see diffusion models being used as a draft model to quickly predict a bunch of tokens and let the autoregressive model decide to use them or throw them away quickly, speeding it up considerably while keeping thinking traces available.
reply
The reason I mentioned "purely autoregressive" is that realistically I expect hybrid diffusion + autoregressive models to be the first popular diffusion models. I could be wrong though. And diffusion models have other tricks like really easy integration with simple classifiers.

Check out this paper where they use diffusion during inference on the autoencoded prediction of an autoregressive model: https://openreview.net/forum?id=c05qIG1Z2B

reply