upvote
Those were amazing times. You could vibe code an entire prototype in seconds (200 tps). With Qwen3.6-35B-A3B and MTP, you can program at that speed on a single GPU at home now, but Kimi K2 is of course much smarter at almost 30 times the size.

I'm also looking forward for the Cerebras Kimi K2.6 release, which should be even better at 1000 tps. It is hard to overstate how important speed is for programming. Instead of having to wait for a few minutes until a task is done, it is just done instantly, and you don't have to context switch from whatever else you were working on while waiting.

I hope they will make it available to regular customers.

reply
But too much of a speed doesn’t allow you to build up the context as the llm is working, it’s a two-edged sword.
reply
Cerebras are only serving kimi for dedicated endpoint customers; for that you need a >$5m annual deal with them

Cerebras also seems to be killing off their regular APIs, they're deprecating models and GLM is still stuck on GLM 4.7, a whole 2 versions behind.

reply
I was quite baffled they removed it and didn't double down on Kimi and serving the latest models instead.

Thanks for the tip, looks fire.

reply