upvote
The point is not to be as good as the multi-trillion parameter model you can host in across 72 GPUs (or whatever).

I'm running a 248B model on a paltry amount of hardware and getting plenty of good use out of it.

Sure, the most demanding tasks will demand the best models (and always will). There's still less demanding tasks for other models.

I think some people are fooling themselves that coding of all tasks is always going to requires the biggest models ever. Again, maybe some coding tasks will, but the majority of business CRUD apps probably don't. Same goes for virtually any other type of task. The biggest models are really only useful for the most complex tasks.

reply
If you wouldn't mind, could you explain a bit what the 248B model is good for, and where it breaks down and you need something better? I hear this take often, but it is always a fleeting remark so I have no idea what the 'useful' looks like - at all.
reply
To answer this and my sibling, it's DeepSeek V4 Flash at native FP4 quantization, on two Nvidia DGX Sparks. Which is a bit of kit but still paltry relative to the data centre. ~40 TPS generation, ~2000 TPS prompt processing, which makes it feel approximately as fast as typical APIs.

I primarily use it with my own harness for coding. I'm not going to say it will compete with Opus in the most challenging domains, because it won't, but I will say that there's a reasonable likelihood that Opus is used for tasks that a model like Flash could comfortably handle at 1/100th the cost.

So far I've only seen it struggle at tasks that I myself would struggle with. Tasks that I can describe the shape of the solution for, it has a high success rate at implementing.

Useful is going to be different for everyone. I'm not working on the hardest problems, I don't need the best models.

reply
In my experience they require much more hand holding and more specific directions with less possibilities to interpret a command in several ways. You do the planning, keep on eye on that they're producing and they do the legwork. It's not that their knowledge of Java or PHP or what have you is lacking, it's the long horizon planning that you have to do yourself. Technically they're good. You just have to do more thinking and more reviewing yourself. YMMV.
reply
Depending on quantization I figure they need at least a p4 and likely a p5 EC2 (or similar instance in another provider) for a model with that many parameters. Maybe they are hosting on bare metal but I imagine not. Those instance types (assuming not using spot) are quite expensive to run.
reply
It’s perfectly reasonable to believe that a law of marginal decreasing returns will kick in at some point (if it hasn’t already), and that what one point looked like an exponential may start looking like an s-curve.

I do not see how being experienced in engineering, or having higher studies in computer science and economics should make that view less common.

reply
If we’re defining on-prem as fitting in a rack - then every frontier model can be hosted on-prem.

Now this might not be the most cost effective (and may require a bit extra power), but you only need a datacenter for training or cost optimization.

reply
The recent MiMo-V2.5-Pro-UltraSpeed can be served from 8 GPUs, which is certainly within the reach of sophisticated on-prem setups. https://mimo.xiaomi.com/blog/mimo-tilert-1000tps
reply