I often have to make very specific edits while keeping the rest of the image intact and haven't yet found a good model. These are typically abstract images for experiments.
I asked gpt-image-2 to recolor specific scales of your Seedream 4 snake and change the shape of others. It did very poorly.
I don’t know how much work it is for you, but one thing a lot of people do, myself included, is take the original image, make a change to it using something like NB, then paste that as the topmost layer in something like Krita/Pixelmator. After that, we’ll mask and feather in only the parts we actually want to change. It doesn’t always work if it changes the overall color balance or filters out certain hues, it can be a real pain but it does the job in some cases.
The Flux models (like Kontext) are actually surprisingly good at making very minimal changes to the rest of the image, but unfortunately their understanding of complex prompts is much weaker than the closed, proprietary models.
I will say that I’ve found Gemini 3.0 (NB Pro) does a relatively decent job of avoiding unnecessary changes - sometimes exceeding the more recent NB2, and it scored quite well on comparative image-editing benchmarks.
It can be (slowly) run at home, but needs 96GB RTX 6000-level hardware so it is not very popular.
Here's ZiT, Gpt-Image-2, and Hunyuan Image 2 for reference:
https://genai-showdown.specr.net/?models=hy2,g2,zt
Note: It won't show up in some of the newer image comparisons (Angelic Forge, Flat Earth, etc) because it's been deprecated for a while but in the tests where it was used (Yarrctic Circle, Not the Bees, etc.) it's pretty rough.
Ring toss: https://i.imgur.com/Zs6UNKj.png (arguably a pass)
9-pointed star: https://i.imgur.com/SpcSsSv.png (star is well-formed but only has 6 points)
Mermaid: https://i.imgur.com/R6MbMPX.png (fail, and I can't get Imgur to host it for some reason even though it's SFW)
Octopus: https://i.imgur.com/JTVH7xy.png (good try, almost a pass, but socks don't cover the ends of all the tentacles)
Above are one-shot attempts with seed 42.
You're killing me Smalls. This one is a 404. I'm really curious what it actually showed.
That ring toss is definitely leagues better than its predecessor. I’m not going to fault it too much for the star though, that one is an absolute slate wiper. The only locally hostable model that ever managed it for me was the original Flux, and I’m still not entirely convinced it wasn’t a fluke. Despite getting twice as many attempts, Flux 2, a much larger model, couldn’t even pull it off.
For the mermaid, https://i.imgur.com/R6MbMPX.png sometimes seems to work but not consistently. It is probably triggering a porn filter of some kind. I need to find another free image host, as imgur has definitely jumped the shark.
The image shows a mermaid of evident Asian extraction lying on a beach, face down. There is a dolphin lying on top of her, positioned at a 90-degree angle. It doesn't show any interaction at all, so a definite fail.
The template prompt seen in each comparison gets adjusted through a guided LLM which has fine-tuned system prompts to rewrite prompts. The goal is to foster greater diversity while preserving intent, so the image model has a better chance of getting the image right.
Getting to your suggestion for posting all the raw prompts, that's actually a great idea. Too bad I didn't think about it until you suggested it. And if you multiply it out - there's 15 distinct test cases against 22 models at this point, each with an average of about 8 attempts so we’re talking about thousands of prompts many of which are scattered across my hard drive. I might try to do this as a future follow-up.
The prompts despite their variation are still expressed in natural language.
The idea is that if you can rephrase the prompt and still get the desired outcome, then the model demonstrates a kind of understanding; however more variation attempts also get correspondingly penalized: this is treated more as a failure of steering, not of raw capability.
An example might help - take the Alexander the Great on a Hippity-Hop test case.
The starter prompt is this: "A historical oil painting of Alexander the Great riding a hippity-hop toy into battle."
If a model fails this a couple of times (multiple seeds), we might use a synonym for a hippity-hop, it was also known as a space hopper.
Still failing? We might try to describe the basic physical appearance of a hippity-hop.
Thus, something like GPT-Image-2 scored much higher on the compliance component of the test, requiring only a single attempt, compared with Z-Image Turbo, which required 14 attempts.