Inspired by this, I tried something much simpler. I asked it to draw 12 concentric circles. With three tries it always drew 10 instead. https://chatgpt.com/share/69e87d08-5a14-83eb-9a3b-3a8eb14692...
It can't get that in a one-shot. Perhaps, though, it could figure out when it needs to break a problem into individual tasks to delegate to itself and assemble them at the end.
Color charcoal drawings do exist, but it’s not what’s usually meant by “charcoal drawing”.
(source: https://chatgpt.com/share/69e83569-b334-8320-9fbf-01404d18df...)
Artistic oddities aside (why are the 8-bit sprites 16-bit, why do the charcoal drawings have colour, why does the art of specifically the Gen 1 Pokemon look so off.), 271 is Lombre, not Lotad.
I have more subjective prompts to test reasoning but they're your-mileage-may-vary (however, gpt-2-image has surprisingly been doing much better on more objective criteria in my test cases)
We have enough people complaining about Simon Willison's pelican test.
Try things like: "A white capybara with black spots, on a tricycle, with 7 tentacles instead of legs, each tentacle is a different color of the rainbow" (paraphrased, not the literal exact prompt I used)
Gemini just globbed a whole mass of tentacles without any regards to the count
This example image was generated using the API on high, not the low reasoning version. (it is slow and takes 2 minutes lol)
The reasoning amount is part of the evaluation isn't it?