upvote
Yes but unfortunately a lot of the discussion that people participate in, are not done from a corporate point of view, but from a normal consumer level.

And there is a lot of drama in those discussions. GLM 5.2 is a great model for corporations to run, but people only want to hear about running a 35B/27B or maybe a 120B model. And in that market, subscription services are simply way better value for money (take in account the privacy issues).

Everybody wants GPT 5.5/Opus 4.8 Max levels, on a budget that simply is not realistic. And GLM fit in that 4.8 medium/low level.

But then people do not want to be told that running a 750b model in Q2 or Q1 is just going to destroy the models accuracy. And that is still going to cost them 5k+ for that reduced model.

The whole local llm landscape from a consumer point of view, is just filled with odd people. lol.

Corporation really benefit from those models, because spending $90k on a server, is a deductible expense. And they are billed at token prices anyway from all the major providers. So its a even faster ROI on that hardware.

I am surprised that nobody figured out to make a business of selling leftover capacity from corporate llm installations, because there is easily 12h+ just wasted (unless its a large corp that has people in all timezones).

reply
> GPUs are extremely underutilized if you launch just 1 generation stream

why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?

I have no intuition yet how this works under the hood.

reply
Some of the inference engines can process multiple requests in parallel more efficiently than doing them sequentially. Not sure of the exact mechanism but e.g. llama.cpp's llama-server can do this (you tell it the number of slots to have when starting, then fire HTTP requests at it and it batches them together when it can).

Waiting for the hooman (or tool calls) won't help either, of course.

reply
The mechanism is that generating tokens (the "decode" phase) in an LLM is limited by memory bandwidth for the weights, so computing multiple streams amortizes the bandwidth over streams as long as you can keep the contexts in RAM. This is most true for dense models and the always-on expert in MoE models, or when you have significantly more streams than experts for MoE models.

In contrast, prompt prefill is more easily compute-bound, so there are interesting trade-offs for latency of decode vs prefill when the LLM utilization is high.

reply