undefined

upvote

points

by goyozi2 hours ago |

upvote

by switchbak8 minutes ago|

[-]

Now the next bottleneck is the compiler - which we can model in an LLM! It's only wrong 15% of the time :)

But truly, using Cerebras at ~2k tokens/s, with very low latency is like a vision into the future. You start to rework your workflow around things that can happen without onerous manual review - stating the conditions for success, etc. It's rare that I have a problem that maps well to that, but I expect this is where things are headed.

Of course the fast models tend to not be the SOTA ones, but if that was the case - high quality and near-instant thinking, that's a game changer that I don't think we're really prepared for. The things that get unlocked with higher-than-reasonable speed become very interesting.

reply

upvote

by flexagoon2 hours ago|

[-]

I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

reply

upvote

by SwellJoe3 minutes ago|

[-]

DeepSeek is the fastest model in the benchmarks I've been doing (https://swelljoe.com/post/will-it-mythos/). Followed not so closely by Opus 4.8 and even less closely by Gemini 3.5 Flash and GPT 5.5. I've been really impressed with it, so far. It's also among the best at doing the work, though still trailing the frontier models from Anthropic and OpenAI.

reply

upvote

by throwaway676781 hours ago|

[-]

Agent mania setting in

It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour

reply

upvote

by smith70181 hours ago|

[-]

I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.

reply

upvote

by leodavi1 hours ago|

[-]

I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.

Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.

I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.

reply

upvote

by Terretta26 minutes ago|

[-]

> the estimates

It doesn't estimate.

It generates tokens that read like estimates associated with the context in its training material.

What would you expect the generator to output instead?

reply

upvote

by AgentMasterRace1 hours ago|

[-]

All the models have broken estimates. They're trained heavily on jira and GitHub tasks and issues, that's why their estimates are human.

reply

upvote

by dizhn55 minutes ago|

[-]

All models do it. It's their training. They didn't have "a person does this in a week but an LLM could in a minute" in their training yet. They also don't have the concept of elapsed time unless you ask them how long something has taken.

reply

upvote

by throw123456789142 minutes ago|

[-]

It repeats what it has seen in the training data. Expecting it to reason about the complexity of a task is a pipe dream. The best is to tell it not to come back with estimates, and when it does, remove them anyway.

reply

upvote

by RussianCow2 hours ago|

[-]

Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.

https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...

reply

upvote

by sarjann1 hours ago|

[-]

I don't think token speed matters as much when a lot of tokens are needed to achieve a task. E.g. artificial analysis benchmarks where deepseek v4 is one of the biggest token burners to go through the benchmark.

reply

upvote

by specproc2 hours ago|

[-]

Yeah, flash is crazy fast, but I've found performance variable.

reply

upvote

by binary001027 minutes ago|

[-]

Flash is amazing if you know the domain really well.

E.g. occasionally it makes the dumbest mistakes you've ever seen and can't correct them. However it's fairly rare, and if you know the domain really well, occasionally popping in the code and pushing it towards the correct solution takes like 20seconds or whatever.

So the speed you can move with flash + high domain knowledge beats opus by a mile in my experience.

I tried to switch back to 4.8 for a bit when it came out, feels so bad waiting 20mins for a mediocre solution when I could have had everything complete - with multiple iteration cycles - in flash in like 3-5mins.

reply

upvote

by flowbarai17 minutes ago|

[-]

[flagged]

reply

upvote

by binary001032 minutes ago|

[-]

I exclusively use deepseek v4 flash now, completely stopped using slow models like Claude.

Basically I never have to wait - yes I have to tell it little corrections occasionally (but I know the domain really well so that's not an issue), but it's so much faster than anything else it's kinda crazy. I love the super fast speeds with high involvement development cycle.

I actually enjoy using agentic development flows for the first time now - whereas with Claude I absolutely hated it. That 5 to 20 min wait after every prompt absolutely killed my desire to even want to work at all.

reply

upvote

by tmaly2 hours ago|

[-]

This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.

reply

upvote

by behnamoh1 hours ago|

[-]

Same. How can DeepSeek serve the V4-Pro at such high speeds despite the sanction?

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by binyu57 minutes ago|

[-]

> Right now Claude is faster than me on some tasks but we’re at least close.

I dont doubt it, but I don't think you can spawn 10 copies of yourself working simultaneously.

reply

upvote

by AlecSchueler52 minutes ago|

[-]

No, but nor can you keep track of what 10 agents are doing simultaneously. Hence the multitasking regret.

reply

upvote

by pixel_popping49 minutes ago|

[-]

An agent can, you don't need to watch tasks, you can have a live digest with another tool.

reply

upvote

by skybrian1 hours ago|

[-]

If we get low enough latency, there's no reason to multitask. You can ask it to do one thing at a time and immediately see what it did. That's a nice way to work!

This is normal interactive UI for tasks that aren't compute-intensive. Programs spend most of their time idle, waiting for us to click a button. We shouldn't be waiting for them or spinning more plates to keep them busy.

However, a faster llm isn't enough. You also need fast compiles and fast tests.

reply

upvote

by efromvt1 hours ago|

[-]

I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.

(I should go measure this now, I'm curious)

reply

upvote

by pianopatrick2 hours ago|

[-]

We fit in for the things that are not artificial.

So long as AI lives in server farms, humans will be needed for tasks in the physical world.

It's only if we combine AI with robots that things get really dicey.

reply

upvote

by fartfeatures2 hours ago|

[-]

This is very dystopian in my opinion. I'm not the arms, legs, sensors and actuators for a machine super intelligence. I wouldn't treat another human as my slave because they aren't as intelligent as I am any more than I would expect to become a slave for a machine. This is our world (for now) and that is why we fit in. Not because we can serve.

reply

upvote

by davedx2 hours ago|

[-]

Agree

https://en.wikipedia.org/wiki/I_Have_No_Mouth,_and_I_Must_Sc...

reply

upvote

by ionwake1 hours ago|

[-]

"It seeks revenge on humanity for its own creation."

This is brilliant as it reminded me of a famous hitchikers quote:

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. — From The Restaurant at the End of the Universe (Book 2)"

Maybe we are stuck in an eternal loop

reply

upvote

by fartfeatures1 hours ago|

[-]

Sounds like snuff porn, not my sort of thing but thanks though.

reply

upvote

by throwaway676781 hours ago|

[-]

Never read Asimov's Multivac novels? Admittedly not all of them are stellar examples of a future to follow

reply

upvote

by cicko2 hours ago|

[-]

"This is our world" sounds a bit exclusive towards other living and sentient beings on this planet.

reply

upvote

by ipkstef2 hours ago|

[-]

asking for curiosities sake. What kind of PR loop are you running that takes a few hours?

reply

upvote

by ketzo2 hours ago|

[-]

not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation

reply

upvote

by RussianCow2 hours ago|

[-]

But those things won't be sped up by a faster LLM, so I feel like that's not what the OP is talking about.

reply

upvote

by goyozi2 hours ago|

[-]

Well, I used an extreme example. OTOH, I’ve done quite a few of those „fix CI” or „migrate X” prompts recently and while there is a fixed component like running CI / builds, I’d say the LLM time is still around or above 50%, especially at the beginning of the project. Then there’s also regular tasks that now take minutes per message which completely get me out of the zone. I imagine iterating on those in near real time would be a big change.

reply

upvote

by devmor2 hours ago|

[-]

Or slow MCP servers that are waiting on HTTP calls from APIs, playwright/other UI instrumentation, etc.

reply

upvote

by goyozi2 hours ago|

[-]

I’m rewriting our integration test suite to run tests in parallel. I have the changes split across 7 branches, and each needs to be fixed to have no flaky tests. I told it I want 3 consecutive CI runs with no flakes and no artificial fixes / assert removals etc. We’ll see what comes out; it’s almost a side project so there’s not much to lose other than some of my weekly limit that resets soon.

reply

upvote

by yunohn36 minutes ago|

[-]

> a side project so there’s not much to lose other than some of my weekly limit that resets soon

Basically the entire token-maxxing AI hype train in a nutshell. Lovely!

reply

upvote

by UncleOxidant54 minutes ago|

[-]

Have you tried Gemini 3.5 Flash? It's quite fast. Amazing how fast it finishes tasks. Much faster than Claude.

reply

upvote

by HarHarVeryFunny2 hours ago|

[-]

I don't see many companies being willing to pay 3x more for faster code generation. Cloud-based AI code generation is already extremely fast, and hardly the bottleneck for most software product development.

There can't be many normal use cases where there'd be any cost benefit.

reply

upvote

by fragmede1 hours ago|

[-]

The "traditional" way we vibe code is human software developer prompts AI -> AI generates code -> (human checks code) -> code gets compiled/deployed/etx -> users use "binary". At the speed of 1000 tok/sec, user prompts obliquely -> AI vets generated code -> code deployed -> user gets response from deployed code.

It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.

reply

upvote

by HarHarVeryFunny1 hours ago|

[-]

1000 tokens per sec is still massively slower than serving a normal web page - if something doesn't respond in a few seconds many people give up.

I'm not saying there aren't any use cases for super-fast (and super-expensive) generation, but it does seem a bit niche. If it was free then sure faster is better, but what are the mainstream use cases where people might pay 3x more for a faster version of something that is already fast?

I think it would have to be an application where it paid for itself - where the 10x faster response was actually worth more than 3x the cost to you - where the extra speed was worth the extra cost.

reply

upvote

by ilaksh1 hours ago|

[-]

Use Claude fast mode and turn off thinking. Tell it to just explain what it's plan is to you at a high level.

It will go much faster.

reply

upvote

by recroad2 hours ago|

[-]

Woah - what’s the prompt and what’s the PR?

reply

upvote

by goyozi2 hours ago|

[-]

I replied in more detail under another comment. TLDR: fixing flaky CI across multiple branches

reply