We expect computers to be consistent despite running programs that are not designed to be consistent.
This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.
But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.
The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.
If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.
Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".
Far worse would be different humans having the same weights.
but moreover, to verify a test item you need to make sure that peopel will select the same answers under teh same conditions at different times. people generally forget the specific questions they were asked if you ask them the same questions a month later so being able to get them to answer the same way each time is important. it is assumed the people have some static knowledge of a topic in this scenario.
If you want to consider a statistical examination of how people answer tests and how we assess knowledge and other things in people through surveying you can read about item response theory and rasch analysis.