Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.
I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.
That said, it'll often say "2 days of work" and then complete the coding in 30 minutes, and while that's amusing, afterwards, I'll need to manually test, or send to other people for review, or realize the agent only actually did half the work and I need to do a second pass (or a third etc.) and then often getting the feature in does genuinely take two days.
It doesn't estimate.
It generates tokens that read like estimates associated with the context in its training material.
What would you expect the generator to output instead?
Sure it cannot think like a human, but given it's input, it should give a good statistical answer (approximating not of how long it actually takes, but what a human would say how long it takes).
Even Gary Marcus is starting to come around and realize that his priors are no longer as relevant as they once were.
Logistics for getting to the car wash next door?
In the mean time, alas, no, we can see from actual prompts sent directly or through sub-agents, and actual replies, estimates remain LLM generated.
Though, this discussion here could change that, because indeed there is a lot of special casing and context stuffing going on, one of the oldest being today's date for example.
• • •
I did read the Claude Code leak, and use pi, etc. So I disagree with your premise rather strongly. Today's "systems" remain, roughly, piles of markdown and context engineering wrapped in UI affordances, and behave very similarly today to how they did in 2024 for those already engineering context and delegating.
https://github.com/cartazio/oh-punkin-pi/blob/main/scripts/b...