So yeah, I think models on local hardware will be quite common soon among the tech savvy (such as people creating software).
A100 -> H100 was >3x tokens per joule, H100 -> B200 >10x. There are significant low-hanging fruit still available in architectural efficiency, and the vendors are chasing them.
This is the big risk for AI companies that I feel is not being sufficiently priced in. Almost none of the investments they are making are durable, the depreciation schedules for everything but the real estate should be less than 24 months. Until the hardware is stable enough that you only get double-digit % increases per generation, it should almost be counted as opex.
E.g. grok isn't truly multi-modal, it has a callable tool that is a separate VLM it invokes on image URLs or files (for a long time it was grok-1.5v, but I think they have upgraded now, it was pretty bad).
And then you have the small summarizer models for the CoT/thought traces, the guidable summarizer models for the standard browse tools, etc.
There's a ton of stuff that can use an aging GPU.
I do hope you're right that it will get cheaper over time (it should), but right now 32GB of VRAM is not affordable to a lot of people. You're talking ~$4500 just for the GPU, or $800 ish used if you can find one.
It's a tad less efficient and a bit more of a hassle, but still a good experience for only a fraction of the price.
I imagine having multiple providers competing will drive down hosted versions of open weight models drastically.
Gotta remember inflation here.
$1K in 1995 was roughly equivalent to $2K now and wouldn't have been a particularly "good" machine then.
In 1982 the Commodore 64 started at about $600 bucks, also roughly around $2K today.
If you outgrew that, beefier machines back then were A LOT. It was easy to find $2k+ towers and (especially) laptops even into the 2000s, and a lot of those would be $5K+ equivalent today.
Especially because the world is likely to persist, at least for a while, in state where computing hardware demand drastically exceeds supply resulting in high prices for hardware. So why wouldn't you want to max out utilisation and amortize costs, at least for typical (non sensitive) use cases.
Certainly the transistors/chip or transistors/$ or flops/$ have not been progressing at the same exponential rate as during 1970-2010. There is still progress, but it's rather slower.
Possibly it's the same price range, allowing for inflation.
> It was only in 2025, as memory prices began an unprecedented surge, that the memory makers started to build new fabs targeted at HBM, all slated to start producing chips in 2027 or 2028.
If you want to argue that this is different from all previous RAM shortages, you can, but the burden of proof is on you to show the difference.
this time demand doesn't stop. there is an exponential demand for tokens.
Started with computers around 2009 and later bought an oldish computer (a pentium 4 PC) for the equivalent of 50 usd. Codeblocks and Python Idle were free at the time (C and Python were the first languages I learned). The barrier to programming has always been low as the only thing you needed was books (the internet made things easier) and access to a PC (I had friends with laptop and my school lab).