These denoising models, the autoencoders more directly so, work by (lossily) mapping the raw input to a very low dimensional representation. The other part generates the desired image back from the low-d representation.
The problem is that nothing, in the vanilla versions, prevent the the low-d version to be a semantics representation such as, Moon, dark hair etc and the generative part to take cues from the semantic representation to a generated sub-image.
The Samsung phone Moon image was likely a result of deliberate choice / company policy, but these things can happen without explicit intent.
And this is also going to depend on the level of compression being chosen. Obviously, the greater the compression, the lesser the fidelity. The lesser the compression, the greater the fidelity.