8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.
It's going to be $120K to $150K to build or buy a system to run this.
But hey you could save on heating?
A single circuit using 10mm TPS would technically be enough to run what you’re describing. Might be pricey though, I’d probably take the excuse to get 3 phase installed so I could get access to the stock of used 3 phase machinery.
In the US it's common to get 200A 120/240V split-phase service. We're talking about the wiring inside the house, though.
How do you think everyone here is charging their electric cars at home and running our AC and electric cooktops at the same time if we didn't also have that? :)
You need to derate for constant loads here, and I assume you have to do that in NZ as well.
So, no, not a "uniquely US issue".
Or even just electricity costs vs token cost
The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.
Not sure if you're being sarcastic, but I can run a quantised version of Gemma or Qwen on my 16GB M1 Macbook Pro that beats GPT-4 from 2023 hands-down.
I wouldn't be surprised if, in another 3 years, you'd be able to run something as powerful as Opus 4.5 or GLM-5.2 on standard consumer hardware - say a 32GB/64GB M7 Pro.
I also wouldn't be surprised if, 3 years after that, cheaper hardware and improved model efficiency means that there's a much smaller gap between what you can run on a consumer CPU (which, with memory prices coming down, could look like a 256GB M9 or M10 Pro) and $100k GPU cluster.
We've been sat with basically the same PC specs for ~20 years - our current specs are within an order of magnitude of the ones we could buy back in 2010. This is not really constrained by tech, as we could have much, much, larger machines. It's more because there's no mass demand for much, much, larger machines - if it's big enough to run Office apps or VSCode then you're good to go. The exponential growth we saw in the 90's was driven as much by software demand as it was by hardware development.
I can see the next 10 years produce the same kind of push for larger machines that the 90's did. And we should probably expect the same kind of standards churn as our existing technologies for storage, memory, etc, don't scale up enough and new technologies become worth developing because there's demand for them.
My productivity profits from the best intelligence available, a decent context size, and a batch size of four.
While my MacBook has 48 GB of RAM, not only do I want the above requirements at a decent speed, but I also need my machine to run the development tools and test suites, ideally without the fans blasting at full load.
For the foreseeable future I will stay with providers rather than local inference, apart from niche use cases.
I'm in Australia, so we're probably not getting access to Fable again. We're learning that a faster model + better harness/framework > smarter model. So being able to run GLM5.2 locally and super-fast would be great.
But the existing tech we're using for 16Gb probably isn't going to scale to 16Tb at a reasonable price point. And the price point is relatively inelastic - people are used to paying <$5K for their computers, and they're not going to go much above that. You'll get early adopters paying $10K or more for a machine that large, but not the early majority. And even then, obviously, $10K is not going to buy you a 16Tb memory machine.
So there's room for a new technology to come in, where there wasn't previously. This is what happened all through the 90's, and we churned through a bunch of standards and technologies to try and keep up with demand.
Are they?
I suspect AI labs are buying stuff not just for their own use, but to make local use too expensive to be an option :-( And they can always make the "best" frontier model even bigger (though only fractionally better) so it's always out of reach of local use, while consumer laptops have nearly the same amount of memory they had a decade ago.
m o
o
d
e
l o
s
i o
z o
e 2020 2022 2024 2026
c
h
e
a
p o
R o
A o
M o
2020 2022 2024 2026Prices aren't going down, and consumer platforms are being shipped with less RAM so we can be sold cloud products. This isn't going to happen.
Can you please explain to me how you're going to fit 700bb-1T params in 64GB of RAM? You realize there are memory requirements proportional to model size?
You don't. What they're saying is that today's small models (that fit on consumer hw) are better than yesteryear's top models. GPT4 was reportedly 8x 220B (~1.6T) MoE, and today you can run a 30-120B model that beats it handedly in real-world tasks.
Similarly for 4-20B models beating GPT3 (175B) and so on.
There is a sweetspot of "good enough" that the small models can reach, where you get equivalent tasks solved fully locally. They'll never touch SotA, but they'll reach 2-3-4 year's SotA. Which, depending on the task you need, it can be "good enough".
Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.
If that's anywhere near right then it seems like a no brainer.
The input cache hit tokens are incredibly cheap for them, (incredibly high margin too, except for deepseek).
And input tokens are in the middle. Input tokens can be processed very efficiently.
Also his math is wrong. $100k gets you 22.7B output tokens at $4.4/M which is how much GLM 5.2 costs.
At 500/s 22.7B is just 500 days. Or about 1.54 years. Which is much less then the life of the hardware.
concurrency
oil workers buy 100k trucks they do not-much with. why not a 100k in computer?
Some, and the market fluctuates a ton.
> corvettes
Only the oldest, most unique model years: nobody is buying (C4-C5-realistically C6) mid-90s or early 2000s Corvettes for more than what they paid for them, and they never will.
Both of those things' value drops like a rock as soon as you buy them and, at least for cars, they don't all appreciate. Most don't. Even so, they appreciate at an incredible slow rate.
I can't speak for watches but I'd be surprised if it wasn't the same situation.
At least the gpus can create value after you buy them before they are worthless.
I assume (since they claim they are selling the batteries to AI data centers), they’ll produce some sort of EV >= F150 once the bubble pops, and we get a new president.
EV is a separate thing. Vastly overmarketed for the technology as it exists today.
Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?
The compression is almost certainly in part specific knowledge getting fuzzed.
Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.
Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.
It's hardly self-evident, and your counter-example is hardly applicable.
The first 10^50 of pi is not the same as having BREADTH of information in the training data, which is the whole point not just any random "information that is irrelevant to your use case".
not to mention that the first 10^50 digits of pi compress to quite small formula, so not much information there to begin with from a shannon/kolmogorov perspective
The memorization of say 100000 world facts through training texts, which enrich model associations all around, is absolutely not the same as rote memorization on 10^50 digits of pi. Not for a human, and even more so, not for an LLM.
An LLM trained with digits of pi and one trained with books and posts, even if they both have the exact same amount of bytes of training input, would not be comparable in any way in utility and reasoning capabilities.
>There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.
Which is irrelevant. Anyway, the amount of information that doesn't form useful logical associations is even larger (e.g. actual human books vs possible permutations of characters and spaces). Just like those (random) possible permutations of characters aren't good for LLM input to get logical associations out of it, pi isn't either (logical associations of the kind we care for and expect, not of the kind related to pi's sequences).
Also it's not only not self-evident, it's also apparently wrong.
You're making the assumption that anything produced by a human necessarily contains more useful information than random noise does. This is false. Even when only considering human intelligence, it's entirely possible to absorb information that makes you stupider, not smarter; learning is only valuable if you actually learn the right things.
I'd say this exchange is a fine example of that :)
We don’t understand AI or natural intelligence well enough to make such statements. As for self evidence, cross-domain competence in humans and the rise of generalist models over domain-specific ones (on competence, not cost) seems to pretty directly tank your hypothesis.
If you believe this then you don't understand AI or natural intelligence well enough to refute my statements either.
Perhaps you're trying to refer to something specific by "cross-domain" competence, but firstly, humans vastly overestimate the extent to which experts in one domain can be trusted to speak accurately on topics in other domains (this is a form of authority bias), and secondly, real cross-domain expertise is a result of pre-existing metacognitive ability such as keen reasoning ability, intense focus, and learning-how-to-learn. In other words, Leonardo da Vinci was not a genius because he was a polymath; he was a polymath because he was a genius.
Likewise, I see no evidence that "generalist models" have proven anything about their ability over domain-specific ones other than that the big AI firms seem to believe that "generalist models" are their golden ticket to AGI and therefore a quintillion-dollar valuation. It's obvious in the long run that tools built for specialized tasks will outperform generalist tools for specific tasks, in the same way that a multi-axis CNC mill does not outperform your bog-standard lathe for shaping objects with rotational symmetry, or perhaps more pertinently to this conversation, how no LLM will ever outperform Stockfish at chess.
assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.