______
/ \
| O O |
| __ |
\______/
||||
/--||--\
/ || \
| || |
| / \ |
\__/ \__/
|| ||
|| ||
/ | | \
/_/ \_\
The issue is the same as why we don't use LLMs for image generation. Even though they can nominally do that.
Image generation seems to need some amount of ability to revise the output in place. And it needs a big picture view to make local decisions. It doesn't lend itself to outputting pixel by pixel or character by character.