To illustrate that there aren't any contradictions (other than the final bit about the reflection in the glass). Consider a macro shot showing partial hands, partial tweezers, and pocket watch internals. That's much is certainly doable. Now imagine the partial left hand holding a half submerged pocket watch, fingertips of right hand holding front half of tweezers that are clasping a tiny gear, positioned above the work piece with the drop of water falling directly below. Capture the watchmaker's perspective. I could sketch that so an image model capable of 3D reasoning should have no trouble.
It's precisely the sort of scene you'd use to test a raytracer. One thing I can immediately think to add is nested dielectrics. Perhaps small transparent glass beads sitting at the bottom of the dish of water with the edge of the pocket watch resting on them, make the dish transparent glass, and place the camera level with the top of the dish facing forward?
https://blog.yiningkarlli.com/2019/05/nested-dielectrics.htm...
A second thing I can think to add is a flame. Perhaps place a tealight candle on the far side of the dish, the flame visible through (and distorted by) the water and glass beads?
Do you want it to actually look like macro photography (neither of the generated images do)? Then you can't have it sharp throughout and you won't be able to show the (sharp) watchmakers face in a reflection because it would be on a different focal plane.
Dropping the macro requirement, you can show a lot more. You can show that the watchmaker is actually old, you can show the reflection, etc.
Something has to give in the prompt, on multiple of the requirements. The generated images are dropping the macro requirement and are inventing some interesting hinging watch glass contraptions to make sense of it.
Sure there are pocket watches where the movement is visible from the front (you'd still likely service them from the back, but alas). Even if you'd do service from the front where the glass is, you'd still have to remove it to drop in a gear.
Anyway, I think that we aren't really talking about the same thing. I'm nitpicking your prompt while you constructed it to mostly see the performance of the model in novel situations and difficult lighting and refraction environments. And that's fair.
How satisfied are you with the generated image results? What would you do different when shooting this proposed scene yourself?
The prompt I did mostly to see how it does with the gears and the tweezers, and the perspective of the gears (do they.. I don't know the opposite word of distort, straighten?, but do they seem like they're actually round, could they work?) I think those are really hard things for AI, the glass distortion, reflections the DoF etc were just to see how it approached that, and like the other comment below said, I tried to pick something that that wasn't likely to be in training data, so it reasoned about it more.
Nano was able to spit it out consistently, Images 2 really struggles, and has yet to complete one I was satisfied with, whereas with nano it nails it almost every time, the 2 images I showed originally are the first shot of the prompt with the models. (here are the 3 other gens from Images2: https://drive.google.com/drive/folders/1s8gik_x0B-xDZO6rOqoz...)
How would I shoot it? I wouldn't, fixing a watch in water is a dumb idea. ;)