Consistency? So it fails less often?
Based on the released images, (especially the one "screenshot" of the Mac desktop) I feel like the best images from this model are so visually flawless that the only way to tell they're fake is by reasoning about the content of the image itself (ex. "Apple never made a red iPhone 15, so this image is probably fake" or "Costco prices never end in .96 so this image is probably fake")
Especially when it comes to detailed outputs or non-standard prompts.
I do believe it will get even better - not sure it will happen within a year but I wouldn't be incredibly surprised if it did.
I experimented with the concept of procedural generation of Waldo-style scavenger images with Flux models with rather disappointing results. (unsurprisingly).
If you asked me what I expected, since this one has "thinking", it'd be that it would've thought to do something like generate the image without Waldo first, then insert Waldo somewhere into that image as an "edit"
It doesn't reliably give you 10 slices, even if you ask it to number them. None of the frontier models seem to be able to get this right
That's because you're focusing a little bit too much on visual fidelity. It's still relatively trivial to create a moderately complex prompt and have it fail miserably.
Even SOTA models only scored a 12 out of 15 on my benchmarks, and that was without me deliberately trying to "flex" to break the model.
Here's one I just came up with:
A Mercator projection of earth where the land/oceans are inverted. (aka land = ocean, and oceans = land)