Nvidia Cosmos 3

Nvidia Cosmos 3

(developer.nvidia.com)

101 points

by tosh3 hours ago |

19 comments

by aabdi3 hours ago|

[-]

SOTA open source model for image and vid generation. Beats all others but is too big to run on most people’s computers at 64b params.

Still impressive nonetheless given its artificially generated training sets.

Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.

by xnx2 hours ago|

parent|

[-]

Great summary. I find image and video generation models are a more understandable reality check for how close local models are to frontier models.

by mangoman1 hours ago|

prev|

[-]

  This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers. 
  Reasoner tower: A vision-language model (VLM) ... This serves as the ‘brain’ that reasons about the world before any generation happens.
  Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding.

This sort of approach (and others i've seen like it) always appeal to my inner engineer, trying to optimize and balance tradeoffs between model architectures and combine two things to yield the best of both worlds

But based on my understanding of the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html), this is precisely the wrong approach in the long term. I'm linking the actual text of the bitter lesson because I think it's misunderstood (or I just don't agree with how i've seen it used in discourse). Specifically:

  The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

This architecture feels specifically like "trying to build knowlege into the agent that will help in the short term" but will plateau long term. That's not to say that there won't be some interesting learnings or things built on top of it, but I doubt that there's a lot of juice to squeeze with this kind of approach IMO.

by aabdi1 minutes ago|

parent|

[-]

This is mostly a decompression, it’s fairly standard nowadays. The point is to get the data from the internal compressed version into the human usable version.

We can technically reason at pixel or char level encodings but it’s going to be much more expensive generally. Think of the overall technique as a way to get computer go faster.

You see it with Qwen talker, most multimodal projectors, etc

by 3PS51 minutes ago|

parent|

prev|

[-]

This feels like the opposite to me? The MoT architecture looks like the ideal that the Bitter Lesson alludes to - just take all of your data in all of your formats (audio, image, text, action, video) and dump it all into a single shared latent space. Then let the model sort things out, with just enough structure to handle the different requirements/output formats needed (e.g. autoregressive stuff for sequence modeling/prediction, diffusion stuff for generation).

by samuelknight13 minutes ago|

parent|

prev|

[-]

Except this model has a broader domain than text-LLM models. More than the old omni models too since it takes video input. The architecture is exotic but I don't see tuning here that is more extreme than open models released every day.

by BugsJustFindMe1 hours ago|

prev|

[-]

The warehouse safety video example is really funny, because the people don't react at all.

by sqeak44 minutes ago|

parent|

[-]

The car video is silly as well, the crossing van clearly runs a red light. The big shadow of the light pole in the intersection also makes no sense...

by timschmidt15 minutes ago|

parent|

[-]

Cars run red lights in real life. Driving defensively requires anticipating it. Anyone expecting them not to is more likely to get in a crash.

The rest I can't speak to.

by darth_avocado2 hours ago|

prev|

[-]

> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.

Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.

by Gracana28 minutes ago|

parent|

[-]

I have the GPU but no robot. What’s the minimum viable robot needed to play with this?

by thewebguyd1 hours ago|

parent|

prev|

[-]

Good news, Nvidia will happily sell you one of their new RTX Spark laptops to run this.

by causal2 hours ago|

prev|

[-]

I'm struggling to understand what this does.

> Generates future observations and action sequences.

Is that just a complicated way of saying video gen?

by swiftcoder2 hours ago|

parent|

[-]

As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video

by heliosAtwork1 hours ago|

parent|

prev|

[-]

It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.

by derac2 hours ago|

parent|

prev|

[-]

Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.

by causal1 hours ago|

parent|

[-]

That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?

by ainch2 hours ago|

parent|

prev|

[-]

You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.

by sosodev1 hours ago|

prev|

[-]

Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.

by overfits-ai1 hours ago|

prev|

[-]

[flagged]

by kushagra12113 hours ago|

prev|

[-]

[flagged]