undefined

upvote

points

by anorwell5 hours ago |

upvote

by dns_snek4 hours ago|

[-]

And how is this comment relevant here? The abstract lists the digestible model names, and you can find the details in the supplementary text:

> To evaluate user-facing production LLMs, we studied four proprietary models: OpenAI’s GPT-5 and GPT- 4o (80), Google’s Gemini-1.5-Flash (81) and Anthropic’s Claude Sonnet 3.7 (82); and seven open-weight models: Meta’s Llama-3-8B-Instruct, Llama-4-Scout-17B-16E, and Llama-3.3-70B-Instruct-Turbo (83, 84); Mistral AI’s Mistral-7B-Instruct-v0.3 (85) and Mistral-Small-24B-Instruct-2501 (86); DeepSeek-V3 (87); and Qwen2.5-7B-Instruct-Turbo (88).

edit: It looks like OP attached the wrong link to the paper!

The article is about this Stanford study: https://www.science.org/doi/10.1126/science.aec8352

But the link in OP's post points to (what seems to be) a completely unrelated study.

reply

upvote

by vorticalbox3 hours ago|

[-]

"OpenAI’s GPT-5" is ambiguous. Does that mean GPT-5, 5.1, 5.2, 5.3, or 5.4? Does it include the full model, or the nano/mini variants?

reply

upvote

by dns_snek2 hours ago|

[-]

GPT-5 is not ambiguous, it's the official name of the model that released in August last year.

> All evaluations were done in March - August 2025.

reply

upvote

by vorticalbox1 hours ago|

[-]

while true, all the others got precise identifiers but for openAI it makes it hard to reproduce because i have no idea "which" GPT-5 was used.

reply

upvote

by zjp4 hours ago|

[-]

Also, nothing has changed! Claude will still yes-and whatever you give it. ChatGPT still has its insufferable personality, where it takes what you said and hands it back to you in different terms as if it's ChatGPT's insight.

reply

upvote

by dryarzeg12 minutes ago|

[-]

Well yes, but no. There's also open-weight models, and literally all of the listed above are not used anymore, at least by most end users and developers as far as I'm aware.

reply

upvote

by emp173443 hours ago|

[-]

No dude, you don’t understand! It’s just so advanced now that you aren’t allowed to levy any criticism whatsoever!

reply

upvote

by TrainedMonkey4 hours ago|

[-]

It's almost like it is based on the training data and regimen that is largely the same between versions.

reply

upvote

by zulban5 hours ago|

[-]

Generally, published papers don't give a damn about reproducibility. I've seen it identified as a crisis by many. Publishers, reviewers, and researchers mostly don't care about that level of basic rigor. There's no professional repercussions or embarrassment.

Agreed - if I was a reviewer for LLM papers it would be an instant rejection not listing the versions and prompts used.

reply

upvote

by epistasis4 hours ago|

[-]

I'm not so sure of that opinion on reproducibility. The last peer review I did was for a small journal that explicitly does not evaluate for high scientific significance, merely for correctness, which generally means straightforward acceptance. The other two reviews were positive, as was mine, except I said that the methods need to be described more and ideally the code placed somewhere. That was enough for a complete rejection of the paper, without asking for the simple revisions I requested. It was a very serious action taken merely because I requested better reproducibility!

(Personally I think the lack of reproducibility comes back mostly to peer reviewers that haven't thought through enough about the steps they'd need to take to reproduce, and instead focus on the results...)

reply

upvote

by catlifeonmars4 hours ago|

[-]

> and instead focus on the results...

This points to (and everyone knows this) incentives misalignment between the funders of research and the public. Researchers are caught in the middle

reply

upvote

by epistasis3 hours ago|

[-]

Eh, I'm not so sure about the funding side there, researchers are not really caught at all and are fully responsible, IMHO. Peer reviewers exist to enforce community standards, and are not influenced to avoid reproducibility concerns by funding sources. The results are always more interesting than reproducibility, of course, and I think that's why the get the attention! Also, there needs to be greater involvement of grad students (who do most of the actual work) in peer review, IMHO, because most PIs spend their day in meetings reviewing results, setting directions, writing grants, and have little time for actual lab work, and are thus disconnected from it.

There needs to be more public naming and shaming in science social media and in conference talks, but especially when there are social gatherings at conferences and people are able to gossip. There was a bit of this with Google's various papers, as they got away with figurative murder on lack of reproducibility for commercial purposes. But eventually Google did share more.

Most journals have standards for depositing expensive datasets, but that's a clear yes/no answer. Reproducibility is a very subjective question in comparison to data deposition, and must be subjectively evaluated by peer reviewers. I'd like to see more peer review guidelines with explicit check boxes for various aspects of reproducibility.

reply

upvote

by zulban3 hours ago|

[-]

I'm not sure how one example contradicts documented huge overall trends, but okay.

reply

upvote

by epistasis3 hours ago|

[-]

I think publishers care about this a lot, but most researchers do not seem to care as much about reproducibility.

reply

upvote

by inetknght3 hours ago|

[-]

> Generally, published papers don't give a damn about reproducibility

While this is sadly true, it's especially true when talking about things that are stochastic in nature.

LLMs outputs, for example, are notoriously unreproducible.

reply

upvote

by zulban3 hours ago|

[-]

> LLMs outputs, for example, are notoriously unreproducible.

Only in the same way that an individual in a medical study cannot be "reproduced" for the next study. However the overall statistical outcomes of studying a specific LLM can be reproduced.

reply

upvote

by ghywertelling3 hours ago|

[-]

The same about surveys and polls. I know no one who has ever been polled or surveyed. When will we stop this fascination with made up infographics crisis?

reply

upvote

by KellyCriterion4 hours ago|

[-]

Do they reproduce any submitted papers at all?

Does this happen?

I can remember this room-temperature-super-conductor guy whose experiments where replicated, but this seems rare?

reply

upvote

by linhns3 hours ago|

[-]

Yes, those are the only papers that worth a jot of reading.

reply

upvote

by bjourne49 minutes ago|

[-]

The comment is wrong -- model versions are clearly specified in the supplement.

reply

upvote

by jameshart2 hours ago|

[-]

I think it’s very important to be clear what studies like this are actually doing.

This study, although it has been produced by a computer science department, belongs more to the field of sociology or media studies than it does to computer science.

This is a study about the way in which human beings consume a particular media product - a consumer AI chatbot - not a study about the technological limitations or capabilities of LLMs.

The social impact of particular pieces of software is a legitimate field of study and I can see the argument that it belongs in the broadly defined field of computer science. But this sort of question is much more similar to ‘how does the adoption of spreadsheet software in finance impact the ease of committing fraud’ or ‘how does the use of presentation software to condense ideas down to bulletpoints impact organizational decision making’. Software has a social dimension and it needs to be examined.

But the question of which models were used is of much less relevance to such a study than that they used ‘whatever capability is currently offered to consumers who commonly use chat software’. Just like in a media studies investigation into how viewing cop dramas impacts jury verdicts the question is less ‘which cop dramas did they pick to study?’ So long as the ones they picked were representative of what typical viewers see.

reply

upvote

by drfloyd515 hours ago|

[-]

It’s as if they are testing “AI” and not specific agents.

I wonder if that is left over from testing people. I have major version numbers and my minor version number changes daily, often as a surprise. Sometimes several times a day. So testing people is a bit tricky. But AIs do have stable version numbers and can be specifically compared.

reply

upvote

by yacin3 hours ago|

[-]

Any paper like this would easily take a year or more to write and go through the submission/review/rebuttal/revision/acceptance process. I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness? What should they do, publish sub-standard results more quickly?

reply

upvote

by anorwell3 hours ago|

[-]

> I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness?

I do think it's a clear weakness. Capabilities are extremely different than they were twelve months ago.

> What should they do, publish sub-standard results more quickly?

Ideally, publish quality results more quickly.

I'm quite open to competing viewpoints here, but it's my impression that academic publishing cycle isn't really contributing to the AI discussion in a substantive way. The landscape is just moving too quickly.

reply

upvote

by yacin3 hours ago|

[-]

The onus is on you to prove or at least convincingly argue that the results are unlikely to generalize across incremental model releases. In my personal experience, the overly affirming nature seems to have held since GPT-3. What makes you think a newer, larger model would not exhibit this behavior? Beyond "they're more capable"? I'd argue that being more capable doesn't mean less sycophantic.

It's certainly possible some of the new advances (chain-of-thought, some kind of agentic architecture) could lessen or remove this effect. But that's not what the paper was studying! And if you feel strongly about it, you could try to further the discussion with results instead of handwavingly dismissing others' work.

reply

upvote

by mkagenius2 hours ago|

[-]

I think you are absolutely right. (had to)

reply

upvote

by jmkni5 hours ago|

[-]

How many people using AI are actually paying for it (outside of people in tech)?

I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up, and I wonder if these are the ones most people are using?

reply

upvote

by theshackleford1 hours ago|

[-]

> I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up

I keep seeing this claim yet it my experience it doesnt hold water. I pay for the models, most people I know pay for the models, and we see all of the exact same issues.

I have Claude and ChatGPT both bullshit and lick my ass on the regular. The ass licking will occur regardless of instruction.

reply

upvote

by rco87865 hours ago|

[-]

If they’re reaching the same results across a variety of the most popular public models, it doesn’t seem like that big a deal to know if it was Opus 4 or Opus 4.5

reply

upvote

by hn_throwaway_994 hours ago|

[-]

Reproducibility is (supposed to be) a cornerstone of science. Model versions are absolutely critical to understand what was actually tested and how to reproduce it.

reply

upvote

by joaogui14 hours ago|

[-]

The models get deprecated after 1-2 years, so reproducibility is pretty hard anyway (but as others pointed out the paper does list the model versions)

reply