I’ve given it different levels of open-endednes, give this flow chart an aesthetic like this mechanical keyboard, or generate an SVG of this graphic from a 70s slide show, but it never looks quite like what I have in mind.
In the end, I think you only use this stuff to generate images if you’re prepared to accept whatever comes out on approximately the first try.
When it does, it's more likely to be something popular and unoriginal, where the data is dense, and less likely to be something inventive and strange.
I wish we could use something like a simple DSL rather than English prose to work with these models, in order to have some real precision to describe what we want.
That will likely happen in the specialized fields. We can already see tools like Figma, Mira, and others that generate functional-ish frontend components in full typescript and corresponding styles (that are also selectable and configurable in the interface). Though, not quite as free, since they do load their base framework and components to ensure consistency and sanity / error-checking, etc., but even then it is in fact generating you useable, modifiable components that you can engage with in precision in your normal DSL.
For video, this likely exists, or is being worked on as we speak. All specialized domain tools will go towards this model to allow those domain experts to use the tools with the precision they expect AND the agentic gains we already take for granted.
My experience with AI image generation is similar, although with a higher success rate (depending on how accurate you want the result to be); but indeed, filtering is a major part of the process.
A lot of YouTube content is really talk, so it was easy to create Sora videos as video content while you talked over them.
However, its failure was that it watermarked everything. WTF? Leonardo didn't do that. Neither did other models. So while video gen was excellent, you always had these ridiculous floating watermarks.