upvote
Every time people point out a limitation or constraint of LLMs, I see a comment that is to the effect of “but humans…”. I don’t understand why this comparison is relevant to this particular thread. Is it just an amusing similarity?
reply
I think it often useful to push the conversation down "we built a system for humans that dealt with this, what from that is or is not applicable for agents in the same context"? Humans randomizing resume review for screening is pretty known; I've seen companies try to fight it with things like hiding information, panel reviews, etc - it's unclear to me how effective those would be for agents (honestly, it was unclear how effective those were for humans). I was depressed about the hiring process before we had AI screening and I remain depressed about it.
reply
It may seem trite but the point is that if separate humans were assigned the same task the LLM was here the results would be similarly non-deterministic.
reply
Indeed: LLMs do tasks that would otherwise be assigned to humans. So when pointing out deficiencies in LLM performance they should be compared to the alternative, which also isn't perfect.
reply
We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.
reply
And this lies at the heart of the problem.

We expect computers to be consistent despite running programs that are not designed to be consistent.

This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.

reply
> This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.

reply
The average user is familiar with games.
reply
Clocks too.
reply
Yeah but daily tools have lots of complexity which appears as non determinism (if we are thinking only UX, not actual determinism). For example, try moving an image in the word doc. I have been using MS word my entire life it seems, still don't know what the rules are lol.
reply
You're using a mouse? I have no problem getting reliable output from reliable input - through keyboard.
reply
What's even worse, different humans have different weights.

If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.

Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".

reply
> What's even worse, different humans have different weights.

Far worse would be different humans having the same weights.

reply
The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.
reply
I'm pretty sure the personality tests are created specifically for the reason that a single person can have fundamentally (or conflicting) beliefs about himself in a matter of minutes. You can say "I am honest person" and the next minute you can say "I never lie" - and both cannot be true for an average person.
reply
Test retest reliability is a thing in psychometrics.
reply
[flagged]
reply
There is evidence that children will oscillate between understanding and not understanding while learning topics. Philip Sadler at Harvard published about this but i can't find the paper im thinking of on his google scholar. too many papers!

but moreover, to verify a test item you need to make sure that peopel will select the same answers under teh same conditions at different times. people generally forget the specific questions they were asked if you ask them the same questions a month later so being able to get them to answer the same way each time is important. it is assumed the people have some static knowledge of a topic in this scenario.

If you want to consider a statistical examination of how people answer tests and how we assess knowledge and other things in people through surveying you can read about item response theory and rasch analysis.

reply
a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.
reply
deleted
reply
That was a single study and it's finding is at the very least disputed, if not debunked, e.g. https://news.ycombinator.com/item?id=41091803
reply
how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks.
reply
deleted
reply