But truly, using Cerebras at ~2k tokens/s, with very low latency is like a vision into the future. You start to rework your workflow around things that can happen without onerous manual review - stating the conditions for success, etc. It's rare that I have a problem that maps well to that, but I expect this is where things are headed.
Of course the fast models tend to not be the SOTA ones, but if that was the case - high quality and near-instant thinking, that's a game changer that I don't think we're really prepared for. The things that get unlocked with higher-than-reasonable speed become very interesting.
It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour
Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.
I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.
It doesn't estimate.
It generates tokens that read like estimates associated with the context in its training material.
What would you expect the generator to output instead?
https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...
E.g. occasionally it makes the dumbest mistakes you've ever seen and can't correct them. However it's fairly rare, and if you know the domain really well, occasionally popping in the code and pushing it towards the correct solution takes like 20seconds or whatever.
So the speed you can move with flash + high domain knowledge beats opus by a mile in my experience.
I tried to switch back to 4.8 for a bit when it came out, feels so bad waiting 20mins for a mediocre solution when I could have had everything complete - with multiple iteration cycles - in flash in like 3-5mins.
Basically I never have to wait - yes I have to tell it little corrections occasionally (but I know the domain really well so that's not an issue), but it's so much faster than anything else it's kinda crazy. I love the super fast speeds with high involvement development cycle.
I actually enjoy using agentic development flows for the first time now - whereas with Claude I absolutely hated it. That 5 to 20 min wait after every prompt absolutely killed my desire to even want to work at all.
I dont doubt it, but I don't think you can spawn 10 copies of yourself working simultaneously.
This is normal interactive UI for tasks that aren't compute-intensive. Programs spend most of their time idle, waiting for us to click a button. We shouldn't be waiting for them or spinning more plates to keep them busy.
However, a faster llm isn't enough. You also need fast compiles and fast tests.
(I should go measure this now, I'm curious)
So long as AI lives in server farms, humans will be needed for tasks in the physical world.
It's only if we combine AI with robots that things get really dicey.
This is brilliant as it reminded me of a famous hitchikers quote:
"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. — From The Restaurant at the End of the Universe (Book 2)"
Maybe we are stuck in an eternal loop
Basically the entire token-maxxing AI hype train in a nutshell. Lovely!
There can't be many normal use cases where there'd be any cost benefit.
It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.
I'm not saying there aren't any use cases for super-fast (and super-expensive) generation, but it does seem a bit niche. If it was free then sure faster is better, but what are the mainstream use cases where people might pay 3x more for a faster version of something that is already fast?
I think it would have to be an application where it paid for itself - where the 10x faster response was actually worth more than 3x the cost to you - where the extra speed was worth the extra cost.
It will go much faster.