think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM
a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again
denser open source models, packing more experts for smaller active layers
it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s
I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.
If a model needs 2x more memory, but serves the same number of customers, the cost is going to go up per customer to cover the increased hardware and power costs. Companies are starting to implement AI limits to keep costs under control.
Anthropic and OpenAI are rumored to be considering cutting inference prices trying to retain customers as LLMs commoditize and race to the bottom. It reminds me of the Chinese bike wars where bike-share companies were losing massive amounts of money, but kept running sales and lowering prices in an attempt to compete and drive out their competitors. The end of that story was a bunch of major bankruptcies and giant bike graveyards.
Nvidia's hard pivot to "in the near future, everyone will run their AI at home" seems to indicate that they also see the market shifting. We've already had AI ingest everything out there. The real challenge becomes how to better optimize their algorithm to get more useful data in less space.
a lot of innovation occurring