To run a 8 bit quantized version of that you need roughly 5TB of RAM.
Today that is around 18 NVidia B300. That's around $900,000, without including the computers to run them in.
It's true that the capability of open source models is improving, but running actual frontier models on your MPB seems a way off.
[1] https://x.com/elonmusk/status/2042123561666855235?s=20 (and Elon has hired enough people out of those labs to have a fair idea)
Today's LLMs are able pack much more capabilities into fewer parameters compared to 2023. We might still be at the very rudimentary phase of this technology there are low-hanging efficiency gains to be had left and right. These models consume many orders of magnitude more energy than a human brain, this all seems like room for improvement.
The right question: is there a law in information theory that fundamentally prevents a 70B model of any architecture from being as smart as Opus 4.7?
Or so they say.
If it's true then that just shows how far behind the cloud providers are lagging while wasting investor money.
(There's a huge amount of diminishing returns in increasing parameter counts and the intelligent AI company should be hard at work figuring out the optimal count without overfitting.)
You could run it on a cluster of nodes that each do some mix of fetching parameters from disk and caching them in RAM. Use pipeline parallelism to minimize network bandwidth requirements given the huge size. Then time to first token may be a bit slow, but sustained inference should achieve enough throughput for a single user. That's a costly setup of course, but it doesn't cost $900k.
Not sure this is a MBP either.
In mid-2028 we have N2E/N2P with around 15% greater transistor density than today's N3P, and by EOY2028 we'll likely have A14 with about 35-40% density improvement.
Meanwhile, we'll be on LPDDR6 by that point, which takes M-series Pros from 307GB/s -> ~400GB/s, and Max's from 614GB/s -> ~800GB/s.
Model improvements obviously will help out, but on the raw hardware front these aren't in the ballpark for frontier model numbers. An H100 has 3TB/s memory bandwidth, fwiw
In practice unless you're doing some kind of deep research thing with the cloud, it'll try to optimize mostly for time and get you a good enough answer rather than spending an hour or two. An hour of cloud searching with huge data stores is not equivalent to an hour of local agentic searching, presumably.
I think that problem will improve a little in the coming years as we kind of create optimized data curation, but the information world will keep growing so the advantage will likely remain with centralized services as long as they offer their complete potential rather than a fraction.