I do agree, however, that the Flux2 family is the SoTA at the moment. Running locally via something like Comfy gets incredible results.
If you want real precision (especially for complex polygonal masks), or if you’re concerned about image degradation over multiple edit rounds, you'll slam against the limitations of those approaches.
Even with SOTA proprietary models, repeatedly editing and re-uploading an image is like making a copy of a copy of a VHS tape: you're gonna see subtle color shifts and quality loss steadily accumulate.
At that point, you either need to put in the manual work in something like Photoshop (bringing elements in as layers and masking them properly) or, as you mentioned, use a model or workflow that properly supports masking.