upvote
> Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.

There was always some of this in the tech world, long before LLMs came along.

I've sat in so many meetings when decisions were made based on "that's what _slightly more prestigious company_ does" rather than objective measurable criteria. (And the evidence that the thing in question wasn't universally followed by _slightly more prestigious company_ carried surprisingly little weight).

reply
Absolutely I agree there has always been some cargo culting going on; that's true of all process-oriented businesses.

But people are now individually acting this way on their desks on an hour by hour basis. LLMs make cargo-culting inevitable because they are inscrutable and opaque.

There is always this sense in the LLM-proponent world that LLMs are at any moment as bad as they are ever going to be; line goes up.

But it seems clear that the gap between perceived and measurable productivity is still likely spent in poking entrails with a stick.

We are so used to probabilistic tools that have significant setup time before they become valuable and save us loads of time that we're at risk of repeatedly writing off that setup time without seeing the rewards, believing that one day it will actually work out that way.

(Which is most recognisable from the early JS frontend frameworks era.)

Meantime here we have an article that shows that a thing (longer context windows) that people thought would functionally solve a problem so we would get the value from all that setup does not, in fact, very meaningfully kick it down the road, and the comments are still full of entrails-and-stick work.

reply
The arbitrary and non-deterministic nature of LLM workflows gives me full on ick. As an old embedded/systems guy I have always prioritized determinism and repeatability in my workflows.

But damn, agents are amazing and I'm enjoying being a "thought process designer". I'm not going back. Even if AI development stops today my career will never be the same.

reply
I felt the same way about the non-determinism but realized it can be really beneficial to have a machine that can fairly reliably turn non-determinism into determinism.

I’m working on a tiny agent harness at home to learn and the process of taking human speech and turning it into agent tool calls that output something generally deterministic depending on how the tool is defined is so interesting.

One of the big takeaways is you really only have to rely on the non-determinism<->determinism translation layer once when you switch between the two domains. You can obviously rely on it more if you want, and that’s probably faster because determinism is hard, but you don’t need too do that.

reply
That sounds very cool. It’s sometimes baffling that LLMs can’t use tools reliably. Serena and Semble both require some arcane instructions to coerce Claude Code into compliance. Just stop trying to pipe nonsense commands into each other, man!
reply
I think it makes sense when you dig into why that non-determinism conversion is so hard.

For voice related things you have a lot of turn of phrase scenarios that can make no sense unless you know. Phrasing like “Put Larry on the horn.” For someone familiar with old lingo for phone calls makes sense. For someone else they might think of a war horn, someone else a music class.

All of those are wildly different situations. It’s not hard to see how one oops between two non deterministic things can quickly go off the rails.

The fact we can get away with so much non-determinism->non-determinism recursion is frankly amazing when you realize how easy it is to imprecisely describe what it is you’re thinking.

reply
The vagary of speech and its meaning is surely hard to parse. But! How many ways must a model invent to run `tsc`?

    npx tsc
    bash tsc
    bash npx tsc
    npm run build
    …
I’m not an expert at all on the subject matter, but is it impossible to train a model that calls tools in a (quasi-)deterministic way?
reply
It's like working with humans.

Can't help but feel like a lot of people who are deep in IT made it there because they hated working with humans.

reply
This has always been a thing with IT advice, though - the more complex a system and the outcome, the harder it is to clearly define "better" or "worse". Add in the fact that LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice.

Heck, even the 'benchmarks' are mostly somebody's attempt to crystallize their vibes with varying amounts of success.

reply
> LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice

Have you ever tried doing evals on moderately complex but bounded tasks?

I spent some time doing it when testing these "token reducing" tools like Headroom, RTK etc. as well as customizing my Pi tools. What I found interesting was that despite LLMs being deterministic, for a given toolset and prompt, the results were highly consistent for a given eval, across multiple models (I tested at the time using GPT 5.4 mini, 5.5, 5.3 codex, Gemini 3 flash, initially running sets of 5 evals on each task but once I realized how consistent the results were, dropping to sets of 3.

Aside: in my tests, RTK and Headroom made the overall context use higher for roughly equivalent results. The context use for those specific toolcalls went down but the number of model turns and overall context use went up.

reply
Gardening advice. Better analogy.
reply
> Any shared sense of rigour is just completely torpedoed by the LLM world

Consider that this shared sense of rigour you have in mind is illusory, and LLMs and their context struggles are simply revealing this. I see precious little rigour in any of the 'tech' world I've lived in for decades. The tools proliferate, paradigms emerge and die and reemerge, and whatever stick you consider using to measure any of it has competitors with different units. Past the physics of power and signaling, and the prevailing cost of a silicon wafer, we are almost all, relative to a small number of much older disciplines, muddlers of various degrees of skill.

I've found dealing with context limits relatively easy: specify and confine. LLMs need clear specifications and strong guidance to produce good work.

But that's just my current muddling take on the practice. Perhaps, 90 days from now, even this burden will be gone, and a simple prompt will generate world class operating systems, programming languages and a formal basis in mathematics for both.

reply
Yep, if anything LLMs revealed how little rigour there was to begin with. If you want a more obvious example: think of documentation..
reply
I feel your frustration for sure and agree to a large extent. Any attempts I’ve made to try to formalize any LLM-based workflows has resulted in me being again dismayed that no one seems to have any real idea of how or why certain things work or don’t work. So I just go back to /plan and “write this down in a markdown document for posterity before we iterate on the implementation”, hoping that maybe next month there might be something a little more rigorous with some kind of rational backing.

> Have you tried cleaning your context with dawn dish soap

I don’t do the glue stick thing at all because I don’t need to, but Dawn really seems to do a good job at getting my Bambu build plate working again. I didn’t seek it out specifically, I already had some for doing dishes. IPA hadn’t worked so I tried Dawn and it has gotten me back having prints stick multiple times now. Not quite up to N=30 yet.

reply
What sense of rigour is going to be in a field (LLM usage as a user) where models, context sizes, tooling and broadly "rules" (scary quotes) change every few weeks? There is no literal change to have a scientific approach to anything, churn is too high, there are papers about model XYZ v 12345 from a few months ago that are already old because there is model ABC on version 54321 that addresses half of the issue shown in the paper and add 3 new problems though.
reply
With benchmarks, you can re-run them after a change. A measurement in a paper will go out of date quickly unless turned into a benchmark.
reply
If you want my best guess: I think large context windows cannot be trained properly. There's not enough material, nor computing power, to train such large networks (to the same degree as small windows).
reply
I feel this is a sort of inverse inspection paradox (the paradox that if you sample waiting time in a process, you’re more likely to sample a larger value).

The LLM providers fine tune the models with some kind of information retrieval tasks, but to do so you must provide some non relevant context to bootstrap the session for the long context tasks.

It would be very easy to do this in ways that train the sequence model to treat early history as noisier than it really is, or to weaken its relationship to late context.

You’re also probably stacking more contexts together with long contexts (start with task A, then detour to solving B and C before you can complete A).

Training sequence lengths probably decay super linearly with length creating far fewer samples at long length during training.

reply
It's not just you! Here's a lovely quote from an influential paper, "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." I think people went through a similar phase with steam engines. Lot's of practical engineering and heuristics to explain what works, before the emergence of a solid theoretical foundation (thermodynamics) to explain why.

https://arxiv.org/pdf/2002.05202

reply
This lack of rigour feels a lot like “did you try restarting the computer? Most of the time, others tried restarting the computer and it works”
reply
first of all, LLM-assisted coding is less than 3 years old. 3 years ago all we had was GPT-4 with 8192 token context, which wasn't enough for most things.

and second of all...

>Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.

what "sense of rigour"? it's way too soon to put those rose-tinted glasses on.

reply
>what "sense of rigour"? it's way too soon to put those rose-tinted glasses on.

I don't think OP is claiming that prior to LLM coding everything in the software development world was super rigorous (I assume that's effectively what you mean with the "rose-tinted glasses" comment). But rigor was actually possible and in a deterministic way too, which is fundamentally impossible with LLMs. You can build all kinds of guardrails and processes around LLMs that make it somewhat approach rigor again, but it's still fundamentally based on a bunch of statistical probabilities instead of deterministic, repeatable results.

All of the methods I see to mitigate the fundamental and inherent issues of LLMs seem roughly equivalent to the kind of crap you see in astrology groups or palm reading etc. You need Venus and Mercury to be in alignment while Mars is retrograde if you want to be able to get the right results from your token predictor.

reply
Astrology? And I thought I was being overly harsh with the 3D printing comparison ;-)
reply
Aren’t human coders non-deterministic? There’s no guarantee two people with otherwise identical levels of experience will always write identical code.

Any software engineering practice that had enough review and feedback to work with humans should work more or less the same with AI coders.

It’s when someone tries replacing an entire team or an entire process with a single prompt that they get in trouble.

reply
>Aren’t human coders non-deterministic?

Sure, but LLMs are non-deterministic in ways that no sane human ever would be. See the "Is it better to drive or walk to the carwash" scenario from a few months ago as one of many, many examples. Or a personal example I encountered just a week ago: I asked Claude (Opus 4.8 in case any of the "you aren't using the latest model that totally fixes that issue" types are interested) to convert a bunch of DB calls that currently use raw ADO.NET calls to use Dapper instead.

The projects in this repo were on .NET 4.8.1 and were still using the older format for the .csproj file instead of the newer (and far better) "SDK-style" format that Microsoft introduced a few years ago. It tried to use the dotnet CLI to add references to Dapper, even though the older format of .csproj doesn't work with that. The dotnet CLI returned errors about trying to add the package references for Dapper, which Claude completely ignored while continuing to try and convert the ADO.NET calls to Dapper. And at the end it tried building the project, which of course failed, and then it confidently informed me that the conversion had been completed successfully and that the build completed successfully and all tests were passing successfully, even though the output from the build it had done immediately prior clearly told the LLM otherwise.

A real human, despite being non-deterministic, would have caught the issue at multiple stages. They would have seen the error when trying to add the reference. If they ignored that then they would have seen the red squiggly lines all over the (deterministic) IDE telling them there was something wrong, along with autocomplete for Dapper calls not working. And if they continued to ignore those and managed to keep going anyways, they would have clearly seen that the build failed, with tons of errors specifically about references to Dapper failing to resolve. An LLM keeps going on its merry way in ways that effectively 0 humans would.

reply
TBD on if the calculator can properly review and participate in the feedback loop with itself.

They also don't learn, so they never get less unpredictable. You can't give the senior robot the production keys and expect it won't delete prod.

reply
Programming has already become this way. Opinions about different languages and architectures are taste, or sometimes even just vibes. Few try to actually ask “can I quantify whether microservices or monoliths are better in terms of either maintainability or scaling?”

A lot of this is a result of systems having long ago exceeded the complexity threshold of things people can hold in their heads. There are too many layers, subsystems, languages, APIs, all glued together. Attempts at radical simplification fail because each of those layers and subsystems has features or behaviors someone needs, and a lot of it isn’t even documented.

AI takes this to the extreme. I’ve already learned that certain models have “personalities.” Some are more likely to go with you on magical journeys into hallucination while others are more critical. Some are better at detail while others seem better at abstraction but fall over on detail. Some are better instruction followers. All their quirks are complex and the systems themselves are impossible to understand.

Computer systems are becoming organic, biological.

reply
"Feeping creaturism" has always been a problem, for sure.

But those technologies are layers, and there are reliable things that sometimes bubble across the boundaries — type hints, better code patterns to trigger compiler optimisation, interesting tricks with key column selection — and someone with expertise from that layer below can explain why, and their advice will always work in situations that are sufficiently similar.

You are right about AI personalities. Obvious even with the open weights models. Gemma and Qwen write code and documentation like people from different cultures. Because I guess they are a bit like that.

reply
They're almost literally "from different cultures" - because of how post-training does things.

All "personality traits" within an LLM are entangled. So when you mid-train or post-train on ESL texts, or run RLHF using people from a given culture, you risk bleeding some of the related cultural traits into the LLM itself. A lot of the resulting "personality" is downstream from different AI teams picking different datasets and training signals.

RLAF is more of a "funhouse mirror" distortion - it takes existing traits and twists them, sometimes amplifies them to comical extremes. Weird can become weirder. A verbal tic can become a style signature. Part of the reason why AI writing from GPT-4 era and to now has changed so dramatically.

reply
It's in the hype train's interest to keep the actual value unknowable. If you quantify what you're paying for then the FOMO is greatly reduced.
reply
> But I am struggling to put into words how alarming I find the comments on threads like this — all sorts of good-natured anecdotes about how XYZ works for them that are more like the suggestions in pet care or cookery threads on Facebook.

It will always be this way going forward. Everyone thinks differently about problems. In the past we had experts and only they could do the work at a high level. But now we have many people that are cranking out expert level solutions without much knowledge. Worrying about the minutia is a dying trend.

Edit: I see I touched a nerve. But that is how it is now. You can't fight reality.

reply
Your argument is that superstition is the way of the future and technical rigor no longer applies.

Because that's what OP is talking about. Superstition presented as factual advice instead of the technically rigorous and scientific fact that already exists.

You're being downvoted because you don't understand this fact, or indeed understand what you're saying at all.

I'll spell it out for you: technically and scientifically rigorous facts do actually exist, even in regards to LLMs. We can, in fact, obtain scientific and objective facts about how LLMs perform. It can be rigorously proven that certain context habits affect certain tasks positively or negatively. Your argument is that none of this matters more than superstition. And you're surprised that arguing to a room full of engineers and scientists that science is dead and superstition is the one true way forward gives you negative response.

reply
There aren't any good facts that exist regarding LLMs. It's a black box. Also, do not presume to know what I understand or don't understand from one comment.

> I'll spell it out for you

You are a rude and crude individual. I am not interested in discussing anything further with you.

reply
It's a black box, but you can run tests to quantify the behaviour and establish, for example, that a certain model is X% more likely to give a certain behaviour.
reply
At some level, we've always delegated worrying about the minutiae to someone who builds the tool that is one or two levels below.

I usually don't have to worry about compiler optimisations because compiler experts do that; sometimes they appear in a thread about code and say "compiler guy here — if you write your code like this the compiler can optimise it".

And that person will be provably right (or wrong), in that context. And it'll be the same each time you run the test!

I just… ehh. You make a good point and I worry you are not wrong. It's all so different.

I like my 3D printing analogy much more than I wish I did.

reply