Still impressive nonetheless given its artificially generated training sets.
Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.
This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers.
Reasoner tower: A vision-language model (VLM) ... This serves as the ‘brain’ that reasons about the world before any generation happens.
Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding.
This sort of approach (and others i've seen like it) always appeal to my inner engineer, trying to optimize and balance tradeoffs between model architectures and combine two things to yield the best of both worldsBut based on my understanding of the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html), this is precisely the wrong approach in the long term. I'm linking the actual text of the bitter lesson because I think it's misunderstood (or I just don't agree with how i've seen it used in discourse). Specifically:
The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
This architecture feels specifically like "trying to build knowlege into the agent that will help in the short term" but will plateau long term. That's not to say that there won't be some interesting learnings or things built on top of it, but I doubt that there's a lot of juice to squeeze with this kind of approach IMO.We can technically reason at pixel or char level encodings but it’s going to be much more expensive generally. Think of the overall technique as a way to get computer go faster.
You see it with Qwen talker, most multimodal projectors, etc
The rest I can't speak to.
Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.
> Generates future observations and action sequences.
Is that just a complicated way of saying video gen?