upvote
"But how to they fully grasp natural language to be able to perform tasks worded unexpectedly, which would be easy to parse, if they understood natural language?"

A Large Language Model. Pardon me for spelling out the full acronym, but it is what it is for a reason.

I think a lot of the whiz-bang applications of LLMs have drowned it out, but LLMs are effectively the solution to the long-standing problem of natural language understanding, and that alone would be enough to make them a ground-breaking technology. Taking English text and translating it with very high fidelity into the vector space these models understand is amazing and I think somewhat underappreciated.

reply
Yes, the newer image and video editing models have an LLM bolted onto them. The rich embeddings from the LLM are fed into a diffusion transformer (DiT) alongside a tokenized version of the input image. These two streams “tell” the model what to do.
reply