What's really interesting for me about multimodal architectures from the ground up is that we might start to see applications where different modalities are "facets" of the same thing. Like a coding agent that sees "code" + "IDE" + "memory mapping" + feedback from different plugins as different modalities. And it gets to output in them as well - text where it needs to, actions (not <action>call_something(params)</action> like we have today) and so on. Being able to "sit still" until one of the modalities triggers is really interesting.
We can do these things today, but they're "bolted on" as afterthoughts. Yet they work remarkably well. I wonder how well they'd work if trained int his combined regime, from the ground up.