undefined

points

[-]

> I have yet to see a "error" that modern frontier models make that I could not imagine a human making

I mostly agree if "a human" is just any person we pluck of the street. What I still see with some regularity is the models (right now, primarily Opus 4.6 through Claude Code) making mistakes that humans:

- working in the same field/area as me (nothing particularly exotic, subfield of CS, not theory)

- with even a fraction of the declarative knowledge about the field as the LLM

- with even a fraction of frontier LLM abilities suggested by their perf in mathematical/informatics Olympiads

would never make. Basically, errors I'd never expect to see from a human coworker (or myself). I don't yet consider myself an expert in my subfield, and I'll almost certainly never be a top expert in it. Often the errors seem to present to me as just "really atrocious intuition." If the LLM ran with some of them they would cause huge problems.

In many regards the models are clearly superhuman already.

by fc417fc8028 hours ago|

prev|

[-]

> you almost never actually interact with people more than a half standard deviation away

I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to.

> This has not been a remotely credible claim for at least the past six months

Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details.

I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.

by winwang3 hours ago|

parent|

[-]

How much of this is expectations setting by the heights models reach? i.e. of we could assess a consistent floor of model performance in a vacuum, would we say it's better at "AGI" than the bottom 0.1% of humans?