They seem to use an agentic LLM with image inputs and outputs to produce, verify, refine and compose visual artifacts. Those operations appear to be learned functions, however, not an external tool like Photoshop.
This allows for "variable depth" in practice. Composition uses previous images, which may have been generated from scratch, or from previous images.
There's an entire line of work that goes "brain is trying to approximate backprop with local rules, poorly", with some interesting findings to back it.
Now, it seems unlikely that the brain has a single neat "loss function" that could account for all of learning behaviors across it. But that doesn't preclude deep learning either. If the brain's "loss" is an interplay of many local and global objectives of varying complexity, it can be still a deep learning system at its core. Still doing a form of gradient descent, with non-backpropagation credit assignment and all. Just not the kind of deep learning system any sane engineer would design.
Predictive coding is more biologically plausible because it uses local information from neighbouring neurons only.
It is probably coming, I get the impression - just from following the trend of the progress - that internal world models are the hardest part. I was playing with Gemma 4 and it seemed to have a remarkable amount of trouble with the idea of going from its house to another house, collecting something and returning; starting part-way through where it was already at house #2. It figured it out but it seemed to be working very hard with the concept to a degree that was really a bit comical.
It looks like that issue is solving itself as text & image models start to unify and they get more video-based data that makes the object-oriented nature of physical reality obvious. Understanding spatial layouts seems like it might be a prerequisite to being able to consistently set up a scene in Photoshop. It is a bit weird that it seems pulling an image fully formed from the aether is statistically easier than putting it together piece by piece.
What kind of sadist would wish this on an intelligent entity?