> To evaluate user-facing production LLMs, we studied four proprietary models: OpenAI’s GPT-5 and GPT- 4o (80), Google’s Gemini-1.5-Flash (81) and Anthropic’s Claude Sonnet 3.7 (82); and seven open-weight models: Meta’s Llama-3-8B-Instruct, Llama-4-Scout-17B-16E, and Llama-3.3-70B-Instruct-Turbo (83, 84); Mistral AI’s Mistral-7B-Instruct-v0.3 (85) and Mistral-Small-24B-Instruct-2501 (86); DeepSeek-V3 (87); and Qwen2.5-7B-Instruct-Turbo (88).
edit: It looks like OP attached the wrong link to the paper!
The article is about this Stanford study: https://www.science.org/doi/10.1126/science.aec8352
But the link in OP's post points to (what seems to be) a completely unrelated study.
> All evaluations were done in March - August 2025.
Agreed - if I was a reviewer for LLM papers it would be an instant rejection not listing the versions and prompts used.
(Personally I think the lack of reproducibility comes back mostly to peer reviewers that haven't thought through enough about the steps they'd need to take to reproduce, and instead focus on the results...)
This points to (and everyone knows this) incentives misalignment between the funders of research and the public. Researchers are caught in the middle
There needs to be more public naming and shaming in science social media and in conference talks, but especially when there are social gatherings at conferences and people are able to gossip. There was a bit of this with Google's various papers, as they got away with figurative murder on lack of reproducibility for commercial purposes. But eventually Google did share more.
Most journals have standards for depositing expensive datasets, but that's a clear yes/no answer. Reproducibility is a very subjective question in comparison to data deposition, and must be subjectively evaluated by peer reviewers. I'd like to see more peer review guidelines with explicit check boxes for various aspects of reproducibility.
While this is sadly true, it's especially true when talking about things that are stochastic in nature.
LLMs outputs, for example, are notoriously unreproducible.
Only in the same way that an individual in a medical study cannot be "reproduced" for the next study. However the overall statistical outcomes of studying a specific LLM can be reproduced.
Does this happen?
I can remember this room-temperature-super-conductor guy whose experiments where replicated, but this seems rare?
This study, although it has been produced by a computer science department, belongs more to the field of sociology or media studies than it does to computer science.
This is a study about the way in which human beings consume a particular media product - a consumer AI chatbot - not a study about the technological limitations or capabilities of LLMs.
The social impact of particular pieces of software is a legitimate field of study and I can see the argument that it belongs in the broadly defined field of computer science. But this sort of question is much more similar to ‘how does the adoption of spreadsheet software in finance impact the ease of committing fraud’ or ‘how does the use of presentation software to condense ideas down to bulletpoints impact organizational decision making’. Software has a social dimension and it needs to be examined.
But the question of which models were used is of much less relevance to such a study than that they used ‘whatever capability is currently offered to consumers who commonly use chat software’. Just like in a media studies investigation into how viewing cop dramas impacts jury verdicts the question is less ‘which cop dramas did they pick to study?’ So long as the ones they picked were representative of what typical viewers see.
I wonder if that is left over from testing people. I have major version numbers and my minor version number changes daily, often as a surprise. Sometimes several times a day. So testing people is a bit tricky. But AIs do have stable version numbers and can be specifically compared.
I do think it's a clear weakness. Capabilities are extremely different than they were twelve months ago.
> What should they do, publish sub-standard results more quickly?
Ideally, publish quality results more quickly.
I'm quite open to competing viewpoints here, but it's my impression that academic publishing cycle isn't really contributing to the AI discussion in a substantive way. The landscape is just moving too quickly.
It's certainly possible some of the new advances (chain-of-thought, some kind of agentic architecture) could lessen or remove this effect. But that's not what the paper was studying! And if you feel strongly about it, you could try to further the discussion with results instead of handwavingly dismissing others' work.
I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up, and I wonder if these are the ones most people are using?
I keep seeing this claim yet it my experience it doesnt hold water. I pay for the models, most people I know pay for the models, and we see all of the exact same issues.
I have Claude and ChatGPT both bullshit and lick my ass on the regular. The ass licking will occur regardless of instruction.