My rough guess is that they set a few workflows combining analytical and ML-based image manipulations to generate the training set. For instance, you can get a long way by having a segmentation model identify and mask various objects and then apply simple analytical manipulations to the masked areas such as changing their color, or diffusing new content into that area using masked guidance to another image diffusion model. In this way, you can create training pairs that your editing model learns to invert, such as “turn the woman’s hair into blonde hair” (start with a blonde haired woman, mask the hair, and get a diffusion model to turn it brown; this gives you the scene you can now invert as a training pair).
reply