undefined

points

[-]

Before they removed it, I was using groq Kimi K2 model for a chat bot in small community site/chat. It was really good, seemed to have incredibly vast general world knowledge and the fast speed (400tok/s if I remember right) meant that chat users got a response instantly which was a much better experience compared to other SOTA models at the time.

On the bright side it looks like Cerebras might be serving Kimi K2.6 at 1000tok/s soon https://www.cerebras.ai/blog/cerebras-kimi-k2-Enterprise

by gpugreg6 hours ago|

parent|

[-]

Those were amazing times. You could vibe code an entire prototype in seconds (200 tps). With Qwen3.6-35B-A3B and MTP, you can program at that speed on a single GPU at home now, but Kimi K2 is of course much smarter at almost 30 times the size.

I'm also looking forward for the Cerebras Kimi K2.6 release, which should be even better at 1000 tps. It is hard to overstate how important speed is for programming. Instead of having to wait for a few minutes until a task is done, it is just done instantly, and you don't have to context switch from whatever else you were working on while waiting.

I hope they will make it available to regular customers.

by throw12345678915 hours ago|

parent|

[-]

But too much of a speed doesn’t allow you to build up the context as the llm is working, it’s a two-edged sword.

by trouve_search6 hours ago|

parent|

prev|

[-]

Cerebras are only serving kimi for dedicated endpoint customers; for that you need a >$5m annual deal with them

Cerebras also seems to be killing off their regular APIs, they're deprecating models and GLM is still stuck on GLM 4.7, a whole 2 versions behind.

by tiborsaas5 hours ago|

parent|

prev|

[-]

I was quite baffled they removed it and didn't double down on Kimi and serving the latest models instead.

Thanks for the tip, looks fire.

by throw12345678915 hours ago|

prev|

[-]

> The largest model they serve now is the relatively minuscule gpt-oss-120b

This model will run on any laptop with 128GB RAM, wow.