Today's LLMs are able pack much more capabilities into fewer parameters compared to 2023. We might still be at the very rudimentary phase of this technology there are low-hanging efficiency gains to be had left and right. These models consume many orders of magnitude more energy than a human brain, this all seems like room for improvement.
The right question: is there a law in information theory that fundamentally prevents a 70B model of any architecture from being as smart as Opus 4.7?
Or so they say.
If it's true then that just shows how far behind the cloud providers are lagging while wasting investor money.
(There's a huge amount of diminishing returns in increasing parameter counts and the intelligent AI company should be hard at work figuring out the optimal count without overfitting.)
You could run it on a cluster of nodes that each do some mix of fetching parameters from disk and caching them in RAM. Use pipeline parallelism to minimize network bandwidth requirements given the huge size. Then time to first token may be a bit slow, but sustained inference should achieve enough throughput for a single user. That's a costly setup of course, but it doesn't cost $900k.
Not sure this is a MBP either.