Another option would be generating these large images, splitting them into grids, and using inpainting on each "tile" to improve the details. Basically the reverse of the first one.
Both significantly increase costs, but for the second one having what Images 2.0 can produce as an input could help significantly improve the overall coherence.