undefined

[-]

I think it often useful to push the conversation down "we built a system for humans that dealt with this, what from that is or is not applicable for agents in the same context"? Humans randomizing resume review for screening is pretty known; I've seen companies try to fight it with things like hiding information, panel reviews, etc - it's unclear to me how effective those would be for agents (honestly, it was unclear how effective those were for humans). I was depressed about the hiring process before we had AI screening and I remain depressed about it.

by castlecrasher21 days ago|

[-]

It may seem trite but the point is that if separate humans were assigned the same task the LLM was here the results would be similarly non-deterministic.

by spwa41 days ago|

[-]

Indeed: LLMs do tasks that would otherwise be assigned to humans. So when pointing out deficiencies in LLM performance they should be compared to the alternative, which also isn't perfect.

by smusamashah1 days ago|

[-]

We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.

by vidarh1 days ago|

[-]

And this lies at the heart of the problem.

We expect computers to be consistent despite running programs that are not designed to be consistent.

This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.

[-]

> This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.

by vidarh1 days ago|

[-]

The average user is familiar with games.

[-]

Clocks too.

by newswasboring1 days ago|

[-]

Yeah but daily tools have lots of complexity which appears as non determinism (if we are thinking only UX, not actual determinism). For example, try moving an image in the word doc. I have been using MS word my entire life it seems, still don't know what the rules are lol.

[-]

You're using a mouse? I have no problem getting reliable output from reliable input - through keyboard.

by miki1232111 days ago|

[-]

What's even worse, different humans have different weights.

If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.

Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".

[-]

> What's even worse, different humans have different weights.

Far worse would be different humans having the same weights.

by thisisit1 days ago|

[-]

The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.

by rkuodys1 days ago|

[-]

I'm pretty sure the personality tests are created specifically for the reason that a single person can have fundamentally (or conflicting) beliefs about himself in a matter of minutes. You can say "I am honest person" and the next minute you can say "I never lie" - and both cannot be true for an average person.

by mnky9800n1 days ago|

[-]

Test retest reliability is a thing in psychometrics.

by spwa41 days ago|

[-]

[flagged]

by mnky9800n1 days ago|

[-]

There is evidence that children will oscillate between understanding and not understanding while learning topics. Philip Sadler at Harvard published about this but i can't find the paper im thinking of on his google scholar. too many papers!

but moreover, to verify a test item you need to make sure that peopel will select the same answers under teh same conditions at different times. people generally forget the specific questions they were asked if you ask them the same questions a month later so being able to get them to answer the same way each time is important. it is assumed the people have some static knowledge of a topic in this scenario.

If you want to consider a statistical examination of how people answer tests and how we assess knowledge and other things in people through surveying you can read about item response theory and rasch analysis.

by cyanydeez1 days ago|

[-]

a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.

by 1 days ago|

[-]

deleted

by ThrowawayR21 days ago|

[-]

That was a single study and it's finding is at the very least disputed, if not debunked, e.g. https://news.ycombinator.com/item?id=41091803

by WhrRTheBaboons1 days ago|

[-]

how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks.

by 1 days ago|