GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

(arrowtsx.dev)

441 points

by oshrimpton1 days ago |

214 comments

by gcanyon2 minutes ago|

[-]

> it is clear that actual intelligence has plateaued significantly

N=1, but I disagree strongly. I'm writing a hard-science science fiction story, and the physics of it is at (and frankly, beyond) my skillset. The story's plot has had to change over a dozen times as I realized errors in my application of physics in the story.

Throughout, I've been reviewing the physics with LLMs, mainly Gemini 3.1 Pro Preview, but also with Claude and OpenAI. Often I have the LLMs debate each other -- "My friend [another model] said XYZ about the physics, is that right or wrong?" In almost all cases, Gemini explains why the other models are wrong, and when I send its explanation to them, they concede it is right and they are wrong.

As I said, I did the above checks literally dozens of times as I wrote the story. And everything was dialed in: no further issues claimed by anyone, me or the LLMs.

Not with Fable. I managed to get it to review the story while it was running, and it listed out something like ten issues: some minor, some general knowledge-based, and two that were impressive:

1. It pointed out where Gemini (and I, and other LLMs) had missed a , resulting in values about 152 times larger than they should have been. I sent that to Gemini and it fully conceded that it had been wrong all along. 2. It pointed out a simple inconsistency in the application of special relativity (I thought I had that at least dialed in, but no :-/ ) that affected a very specific plot point. The story is novella-length, about 28,000 words long, and this is a point that was mentioned in the first two pages, and then not again until the very last page. And it's obvious, once you realize it. And I missed it. Gemini missed it. Claude and ChatGPT missed it.

Only Fable found it. Again, N=1, but that was a remarkable run I got out of it in the couple days it was available.

by wolttam6 hours ago|

prev|

[-]

> it is clear that actual intelligence has plateaued significantly.

> Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse

These are wild claims - why are we concluding that bigger models and more data = more hallucination? That’s actually the opposite of what’s been happening over the last couple years. Some models may still hallucinate more but they all hallucinate much less than the original 175B ChatGPT which was smaller and trained on (much) less data than anything current.

Edit: My mention of data comes from this quote:

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling

My take on the current situation: it seems clear that the industry has seen that there is still a lot left to squeeze out of sub-1T models. But for that you do need more, high-quality data in the distribution which you want to unlock capabilities for.