A100 -> H100 was >3x tokens per joule, H100 -> B200 >10x. There are significant low-hanging fruit still available in architectural efficiency, and the vendors are chasing them.
This is the big risk for AI companies that I feel is not being sufficiently priced in. Almost none of the investments they are making are durable, the depreciation schedules for everything but the real estate should be less than 24 months. Until the hardware is stable enough that you only get double-digit % improvements per generation, it should almost be counted as opex.
E.g. grok isn't truly multi-modal, it has a callable tool that is a separate VLM it invokes on image URLs or files (for a long time it was grok-1.5v, but I think they have upgraded now, it was pretty bad).
And then you have the small summarizer models for the CoT/thought traces, the guidable summarizer models for the standard browse tools, etc.
There's a ton of stuff that can use an aging GPU.