And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.
You can’t benchmaxx an eval that comes after your model release.
Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.
Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.
Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.
This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.
Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.
"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."
* https://www.anthropic.com/engineering/eval-awareness-browsec...
"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"
* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...
"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"
* https://www.edtechinnovationhub.com/news/anthropic-says-clau...
To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.
You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.
> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking
yes this is incredibly common. I'm not talking about hypothetical scenarios.
> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.
Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.
Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.
These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.
throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.
That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.
This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)
> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??
That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.
But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.
Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.
Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?
They:
- hallucinate constantly
- can't follow basic instructions
- think they're Claude for some reason ;)
no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.
Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.
"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.
zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.
Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.
You mean benchmarks about the programming language that produce the fastest code?
That is not really the same.
So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...
Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.
I know how stupid that sounds but it's true.
Well what do they say... "If it sounds stupid but it works, then it's not stupid!"
I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.
ymmv
My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".
AV professionals always say "timecode" - timestamp is a programming term.
Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".
1. It's exponentially better
2. yet, somehow, hand coding still isn't dead, at least for me
I use Cursor and if I ran Claude models for 30min I might exhaust my mobthly budget! Maybe it's an API billing issue though
And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.
It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.
There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.
Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.
You can also do benchmarks but how do you measure the output of those?
The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.
I don't think this is that subjective or vague.
There are a couple of crisp metrics that can be used to evaluate a model:
- given a prompt, does it finish a task (times X tasks)
- how much did it cost to finish the task
- how long did it took?
If all models are able to handle a class of tasks, they perform equally well.
If a model costs much more to finish a task, it is worse than other models.
If a model takes longer to finish a task, it is worse than other models.
The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.
Or just that it's so much cheaper that the cost/benefit ratio is better?
Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?
I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.
And that says it all.
> Or just that it's so much cheaper that the cost/benefit ratio is better?
That too is another definition of quality, isn't it?
If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.
> Also "finish a task" is also subjective.
No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.
> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?
Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".
I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.
And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.
"Don't make mistakes" does seem dumb. It's not guidance.
https://simonwillison.net/about/#disclosures
"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."
But I'm totally unbiased on my gut-feeling posts, trust me bro.
-- AI influencers.
I said "Anthropic didn't give me early access to this model, shouldn't that bias me against it?"
I was explicitly pointing out that their failure to give me early access had not, in this case, lead to me reviewing their model poorly.
I try very hard not to let things like early access affect my reviews of models. I was hoping this particular situation could help illustrate that.
Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.
If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.
How is a side by side direct comparison NOT precise?
[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix
.
[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.
I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.
I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.
Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.
In my opinion, if one cannot express themselves civilly, they should refrain from commenting.
AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.
It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.
I don't think it was "a huge waste of time" or needed your rant.
You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.
What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.
I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.
I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.
The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.
> if one cannot express themselves civilly
It was neither unhinged nor uncivil. Maybe you responded to the wrong comment by accident?
> they have permission to insult someone's competence and work
If it's AI, it's not your work. And even if it was - criticism of your work is not a personal insult. This criticism is flatly invalid.
> this post gets me irrationally irritated and makes me want to shake you and shout
Yes, criticism of my work would not generally be a personal insult.
However, if you were to call my work 'slop', and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level. This is not a civil or respectful way to talk to someone.
>> this post gets me irrationally irritated and makes me want to shake you and shout
Did you read the rest of the comment? The rest of it is civil. It's normal for people to start by saying something like "this makes me frustrated" as a preface to indicate their feelings, and then not actually act frustrated and instead calmly work through their thoughts. That is a meatspace social convention (not just an online one) - are you not aware of it?
> However, if you were to call my work 'slop'
And, as previously established, if you use AI, it's not your work.
> and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level
...and those are still criticisms of your work, not yourself.
The actual problem here is that you are taking offense to things that are not offensive, not that the parent poster was being uncivil. Thinking that calling someone "inexperienced" is a personal insult is absolutely insane. That's a wildly miscalibrated sense of how social dynamics work and what it actually means to insult someone.
You and others are right though, that there's potentially interesting or enjoyable stuff in there (maybe I should have lead with that?). It's just a large volume of it is not useful in response to a question specifically looking for more quantitative or detailed usage analysis.
Fable is doing - so far - a great job. I just had one big question around how part of it should work. I had a design sketch, but with some big unknowns. I asked fable to figure it out via reasoning and prototyping, and it did - it even, under its own initiative, wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it. And it found, and fixed, a couple bugs that I'd missed.
I'm sure its weaknesses will become apparent in time. But, wow this thing is a beast. Its the first time I'm reading the work of an LLM without spotting obvious weaknesses in its reasoning and code. I'm really impressed.
I work on the live collab at my company, and using AI while coding has into recently sort of “clicked” for me. We use an (I’m pretty sure) unheard of algorithm for collaborative editing, and I’ve had a long term goal of turning it into an implementation of EG Walker, but our document model is very complex and most out of the box CRDTs don’t quite fit. Maybe Fable will be what gets me over the hump.
https://blog.helsing.ai/posts/dson-a-delta-state-crdt-for-re...
https://www.youtube.com/watch?v=4QkLD7JhD_I&pp=ygUJZHNvbiBjc...
Worth noting, the decision to eschew CRDTs predates my time here, and I've pushed for a CRDT rewrite quite a bit since I believe it could be done. The other main concern they had was memory usage, but it seems like EG Walker would solve that. Our system uses a "Commit DAG", (an Event DAG by another name), and does a three-way merge using a common ancestor of the diverged documents, and so a lot of the bones of EG Walker are there, and I'm exploring ways in which we could gradually move to it.
I saw scanning the comments and saw you mentioned CRDT. Just wanted to mention that I implemented a CRDT-flavoured sync engine for the product I'm working on a while ago, I think it was with Opus 4.6 if I'm not mistaken (or earlier) so it's not something new to Fable 5, just fyi.
So far at least - and its been less than a day - Fable seems better at this.
I think I also do my CRDTs differently from others. I've grown to like the pure-oplog approach after making eg-walker. LLMs are much worse at this!
For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.
I’ve read plenty of papers with “formal proofs of correctness” that turned out to have huge flaws. Machine verifiable proofs I trust. But I’ve personally found more bugs with fuzzing than I have via proofs.
I have found this quickly becomes false. I have learned I cannot review llm generated code as if it is written by a trusted senior developer (where I often just do a quick look, see nothing obvious and hit approve). Once you start reading the code in depth with the goal of understanding you quickly see the places where flaws are likely. Sure I start with no clue where to look, but it doesn't take long to see things.
Of course not. That's why they are so rare. But I thought we live in an AI era now where this kind of stuff can be done by a machine.
Damn you must be good, I've been feeling this for around 2 years now
"But it doesn't do well when writing my undertrained language" - yeah, fine. Yet. Reasonable code in that is probably one RAG + verification scaffold deployment around Mythos or maybe mythos+1. Just like it was for you learning it, because you knew how to _program_.
AI is just another tool, learn to use it.
Like it did everything:
- this is not a Linux system (true, it was macOS) - it is not an available command - the binary is corrupted - node/js is more precise - V8 JavaScript is faster than bash (true technically??? But not in this context lol) - JavaScript is more versatile
I forgot what else we went through but there were a few more things. I indulged it because it was incredulous and funny. The prompts from my side were all questions, never instructions. I assume an instruction would've helped here, but also I don't think Opus ever did this (but on the other hand Opus wrote python scripts to format/indent, instead of just running cargo fmt, so I guess potato potato)
> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more
I'm working on an internal tool that does new business prospecting data collection, scoring, etc. This is ridiculous.
Assuming the model is being “truthful”, CC is just being stupid in its detection mechanism.
I’m having a really hard time believing some weak reason for a 30 day retention policy.
e: I quit the session and went back in. Set it to Fable and told it to continue the last session. It's moving along as if none of that had happened.
How weird.
https://www.wired.com/story/openai-anthropic-letter-ai-biolo...
Question is if there will be any competition in this area...
Or Fable’s arch is different enough the allocated clusters of compute targeting a date, and here we are, ready or not.
Or…
I see a lot of people saying they are happy with weaker models, but I am the opposite, I need more strength, more intelligence!
I am quite happy that opus 4.8 can do some medium intelligence problems. And maybe Fable 5 can do some more more of those! I have a lot of problems to solve!
At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.
The results were near useless.
The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.
Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.
Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.
I am just testing it on stuff I know intimately myself. I would probably not understand a proof of Collatz if it was dansing in front of me!
Sorry to belabor this but it's basically pointless saying you have nuts it can't crack without showing us the nuts.
I gave a high level description of the problems in a sibling thread. They are the kind of small problems which I suppose every researcher has lying around, waiting for them to think about some day. But not the big problem everyone is waiting for to be solved.
My comment was not meant to be a tease – sorry! I assumed there would be other people in a similar situation, who might relate.
The curse of the 'use case' comes in here too. When people think that everything should have a use case, that's a lot of training data suggesting to a model that things should only be used for what someone has already thought of.
A couple of times I have had to manually code proof of concept pieces so that the model breaks out of that "unpossible" mode and actually helps me.
I can't remember if it was chatGPT or Claude, but when I showed it how to get a MessagePort in its JavaScript executor through to the artifact/canvas, it quickly went from "That can't be done" to positively enthusiastic about the possibilities. I suspect those shenanigans will be well off the table for Fable though.
(Joking aside, see sibling threads.)
Did you add "make no mistake" to your prompt?
Recently (last couple of months?) these models are becoming useful tools for mathematicians, because they can solve easier problems more quickly, meaning that one can tackle bigger challenges (but maybe not RH et al) piece by piece.
But, there are still definite limits, where one could expect an expert human to solve things, given time, but models do not. Thus, more intelligence would be nice!
I am pretty sure this time I am catching the sarcasm here. Kudos you had me in the first half.
These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.
A year ago my judgement was that I had wasted my time on trying to work with the models and doing things myself would have been more productive as I would have gained intuition from the failures. Now it definitely seems to have figured out stuff that would have taken me more time than I have to spare on this problem...
Being a theory builder more than a problem solver I am excited for the future.
Also excited for fully formalised mathematics to hit main stream!
No idea what's going on here but agent tested a bunch of stuff. Then I asked to build a wheel so I can run the command you noted above and it appears to pass
For those who are curious...
https://github.com/bamggm/micropython-wasm/commit/5ddebae592...
https://github.com/bamggm/micropython-wasm/commit/8b362fba1f...
Soon the times of AI for $20/$200 a month will be long gone.
Forcing developers to pay for models that were build on code they scraped scott-free
A tax to do their job that developers are jumping at the chance to pay
Everybody's finally realising that node dependencies are a threat, but letting these AI companies gatekeep the industry is a bandwagon people are scrambling towards
Yes this makes me sad behound explanation. Specially when I see open source developers happily using these tools. These companies stole your, free, hard work and charge you a subscription!! Not to speak about them torrenting books and (most likely) training on private repos.
This and devs paying a subscription to use a tool that is marketed as trying to replace them.
I had 150$ monthly budget thatbI used for various open source projects and I've cut that entirelly.
In case you weren't aware, Anthropic, OpenAI and GitHub Copilot all have programs that provide access to open source maintainers for free:
GitHub: https://docs.github.com/en/copilot/how-tos/copilot-on-github...
Anthropic: https://claude.com/contact-sales/claude-for-oss
OpenAI: https://developers.openai.com/community/codex-for-oss
> Six months of ChatGPT Pro with Codex for day-to-day coding, triage, review, and maintainer workflows
Those are free trials pending their approval in hopes of more paying customers, nothing more.
Then you say you had money that you used to donate(?) to OS and have cut that because of the frustration?
Open source just means sharing the source code for people to learn off or have the ability to customize on their own. I don't think there is any need to be frustrated about that (now if it was copyright/private of course).
Yes people, not corporations. The point is there a licenses to be respected that weren't.
We could fix that, but it requires a political will to change the law.
> To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies.
That's also caused by some very smart (even brilliant) developers (you can see many of them in this very thread) choosing to be oblivious about all this and bury us all under, hoping that they'll be among the last ones to go. Writing this down I realise that they maybe aren't all that smart.
It would not surprise me one bit to see anywhere from $80k-$100k/seat pricing.
Most of us don't need a model that can prove the Riemann hypothesis or Goldbach's conjecture in order to get work done.
Not everyone needs a Ferrari to go for a weekly shopping.
Maybe? If you talk to executives, the impression that I am getting is that they tend to be somewhat misinformed at best, which, yes, is bound to result in some really bad decisions down the road. But, and it is not a small but, the ones I did talk to ( and, amusingly, those are the ones with strong opinions ) don't seem to have a lot, um, practical exposure to this tech beyond what they heard at the watercooler. Honestly, it is kinda infuriating. And all this before we get to how companies want to say they use AI, but also keep cost down.
You and your work are not that special, you are not participating in car races, and you don't need a Ferrari.
You might want to ask the guy who said it first what he meant; I was just pointing out that your work isn't particularly Anthropic-biased, in my experience.
Words apparently don't mean anything anymore.
I’ve done the same thing with opus multiple times with no issue. According to ccusage I racked up just shy of $100 of tokens using Fable.
It spun up subagents or workflows or whatever so obviously that contributed but “double opus” was not my experience. I’ve done the exact same prompt with opus on the highest setting and only once before (not even while using this prompt) hit my limits.
My prompt? I’m not a prompt wizard or anything but it was literally:
> Please review the uncommitted code in this repo for bugs/issues/code smells.
I use variations on that all the time with opus and never had issues. I figured it was a good one to kick the tires with Fable. Little did I know it would mean no more Claude Code for the next 4.5hrs (unless I wanted to pay) after this being the first time I had used CC that day (yesterday).
All in all, a pretty crappy first experience.
uvx agentsview usage daily
Then edit the config file to add Fable pricing as described here: https://til.simonwillison.net/llms/agentsview-custom-model-p...And run the command again. I get $126.89 for yesterday.
DATE INPUT OUTPUT CACHE_CR CACHE_RD COST MODELS
---- ----- ------ -------- -------- ---- ------
2026-06-09 142015 85315 321224 6880110 $10.96 claude-fable-5, gpt-5.5, claude-haiku-4-5-20251001
I tried to filter down to just fable (or 5.5 so I could deduct it) but the `--agent` flag doesn't seem to work how I'd expect...I think the $10.96 is coming from gpt-5.5 since I switched to it once I exhausted all my usage on CC. CCusage reports completely different numbers so I don't know which one of those is right.
Thanks for trying, for yesterday ccusage says "$92.02" for claude, which I assumed was the Fable usage.
uvx agentsview serve
You'll get a localhost web application which makes it much easier to filter by model.Unfortunately it's not telling the whole story. The last message from the _only_ Fable session it monitored was:
> The data layer looks clean — <REDACTED>. Now waiting on the 11-angle workflow — verification and the gap sweep run after the finders; I'll compile the full ranked findings list when it completes.
And my memory jives with that, I could see in the footer that it had spun up 11 agents (though agentsview says it used 0 subagents, don't know if it was "actually" workflows that it spun up?). It's like it didn't record the sub-sessions/sub-agents info?
I'm still shocked that my prompt (which I now can see thanks to this tool) of:
> Please review all the uncommitted work in this repo and identify any issues.
was able to burn so much, so quickly, and, most frustratingly, without actually doing anything useful because killing it was my only option lest it spend even more of "extra usage".
Overview of usage: https://cs.joshstrange.com/RjGzWVXy
Stats for that 1 session: https://cs.joshstrange.com/Fj5qv1wl
I've been watching my usage quota bars drop as I use the model, so I don't think I have a weird quota issue going on here.
To be clear, the jump from Opus to Fable was like the jump from pre o3 -> o3 for me. Very sharp improvement, not incremental. But that could be explained by dummy long thinking times.
It one shot a task that Opus burned hundreds of dollars on to get nowhere. Very tricky semantic refactor, got it right. Granted, again, the semantics Opus and I fleshed out 3 months prior, but Opus couldn't execute on the vision. Fable could.
Then I discussed some philosophy and it was actually both pleasant (GPT constantly "corrected" you for the sake of correction without clarification, also still often just wrong; it's like it refused to think critically about philosphy) and accurate, and actually helped resolve some deep but subtle misconceptions I had around representationalism. When talking with GPT I felt like I was talking with someone who either was sycophantic or "anything that is not absolute truth is relativism" - Fable actually discussed.
Both is exciting and kind of makes me depressed. I can definitely see why people are getting hyped about AGI again. All the models were extremely strong technically but I felt like couldn't match the developer's tacit state - Fable definitely did, and that's a basic quailty to be considered "usefully intelligent" IMO, at least to me.
Shame that it's going away in 2 weeks and probably going to be nerfed if/when it's re-released.
But technological serfdom is waiting just around the corner. Well, to be fair, I think that societal forces would've pushed us to it anyways, no AI needed, but AI is a visceral, immediate, fast-moving instantiation of it.
So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?
If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.
AI is interesting as long as it can save time and/or money in getting an acceptable result. Anything that runs on a computer and can do "things that humans can do" will automatically end up doing things that humans won't do, simply by virtue of the fact that it runs on a machine that doesn't require sleep, doesn't get bored or demotivated, etc.
Verifying code (to a level where a responsible person is willing to take ownership for it) isn't trivial, sure; but writing the code by hand requires the same level of care, and the fact that the same person wrote it doesn't actually allow for shortcuts (if we're being properly responsible).
What if an LLM overall starts to make less mistakes than a medium developer, costs less than its salary and is 100 x faster? For sure, the companies that will leverage these with just a few senior devs doing prompting, testing and requirements analysis, will outcompete other organizations.
AI agents do that, perhaps not always, but still do. Now the question: would I trust AI without verifying its output?
Do you verify every line of code written by your fellow developers? I doubt it, which is strange because they make errors don't they?
What matters is the error rate. Past some threshold and they're better than senior devs who you don't supervise closely.
I will say that there are hardly any mis-steps in its chain of reasoning, but some odd approaches to problems and a fair bit of redundancy. Probably the most impressive part was spontaneously coming up with non-obvious issues to test, but this came with a fair handful of tests for obvious non-issues (like whether pip can extract a nested zip from a wheel without corrupting it).
As an additional check, I just submitted it to Fable, and it eviscerated it. Tons of inconsistencies found, issues skimmed over or ignored, too optimistic assumptions, math that doesn't really add up if you look at it in context. And as far as I can tell, all of these issues are entirely valid. I now feel embarrassed I'd already sent it to a few people for review. This clearly needs more work.
I might be missing something important but that doesn't seem to be an impressive task.
On a surface level it sounds like the taks requires gathering calls to MicroPython-specific libs, assess which ones are not compatible with Python, and proceed to determine how to replace the ones that are incompatible.
From that first iteration, the rest would boil down to troubleshooting the issues missed on the first shot.
I would be extremely surprised if the likes of GPT4.1 wasn't already capable of handling that task.
So, beyond Claude Fable finishing a task, what exactly is the differentiating factor?
It feels like you can give it a big chunky problem and leave it alone and it gets it done, with less questions and fewer design decisions that I wouldn't have made.
In reviewing its code I'm finding less to complain about than Opus. But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.
> But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.
https://generative-ai.review/2026/06/claude-fable-rush-test-...
I get them to make a 3D explainer animation. You can clearly see Fable is much improved on both Opus 4.8 and ChatGPT 5.5.
Better Textures . A nifty camera follow . Humans rendered better . ... see for yourselves
Fable just did it, clean code, one timeout with a hanging bash script, fixed a couple very old very structural bugs in the codebase
I am not sure it's perfect, and it will need further validation
This morning I looked at code samples & checked if all unit/integration and e2e pass & perfomance tests pass
I also generated a postgres schema diagram.
Aka I did probably 2 hours of work, rest was not me
The opus try was last month
Which has a full build of python to WASM with a bunch of static libs built in already.
I will say I built this pre fable and actually the first build of the interpreter to WASM opus pretty much nailed, cpython has secondary support for WASM as a target since like 3.9 or something and it just pulled from that.
I’ve been meaning to write up a blog post about this sometime, building this has been pretty interesting, including using opus to run a full auto research like loop for days to hyper optimize it’s performance.
I’m hoping to use fable to power some even crazier WASM adventures tho.
I wanna press it, but I don't have that kind of mad, generational wealth to put a prompt through on that setting.
It made sense for people doing proper and fair AI breakdowns waiting on an embargo, but now it's just slop I don't trust anymore.
> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events.
Update: looks like I've spent $82.92 in Fable 5 API priced tokens so far today (still all included in my subscription.)
Here's a TIL on how I'm calculating spending using AgentsView: https://til.simonwillison.net/llms/agentsview-custom-model-p...
* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.
* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.
* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.
It's been discussed at length (on this site, on other sites, on like every blog ever, etc) that, eventually, those subsidies will end, much as the $5-10 Ubers/Lyfts I used to take from the far north end of Chicago into the Loop in 2016 would eventually end once those companies had a footing and didn't need to hook folks.
So - yeah, I mean, a v5 model launching in a year where Anthropic has a rather deeply established market and in a year where AI costs are rising from nearly all providers (sometimes for multiple reasons) seems like exactly the thing I'd expect them to pull the subsidy plug on after a launch teaser.
(Even the open-weight models sometimes do this: for example, OpenCode Zen/Go has a rotating door of free models at any given time that eventually leave the free tier and move into the paid tier once the launch day hype/marketing dies down)
Also, a fun website: https://isaiprofitable.com/ (thr numbers are probably made up)
That site doesn't list the dozens of companies doing pure inference, and making a profit while doing so.
Are the finances public for any of these companies? I'd love to take a look at them.
Compared to what?
(You may not realize it but simonw is one of the cofounders of Django, Python's web framework. If they find a Python problem difficult, it probably is.)
Web development is not a domain I would consider noteworthy of making a framework given how much development there has been in that area.
It's frustrating that superfluous tokens are burning up our quotas:
key insight, crucially this, real engineering deltas, net assessment, definitive picture, acid tests, real limits, sharp boundary, proper patch, real root cause, big progress, actually wrong, path finagling, the catch, root cause pinned, everything passes cleanly.
Though that's also what makes humans so good at solving problems as well, it turns out.
Also, slight tangent: but I do find the "clanker" insult kind of funny. I feel like it counter-intuitively makes the models sound cooler than they are, if anything. I love clankin' shit.
Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.
That may very well be true now. And in fact, this was true of more rudimentary calculations early on in computing history, where humans were definitely more efficient, particularly for more abstract mathematics. But Moore's Law comes at you fast. Even without more efficient compute, it's rather wild how much more efficient models are becoming these days just from algorithmic and training improvements.
So, maybe for now, certainly. Are you confident that will be the case in 5-10 years? And is that really your barometer for success?
>And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces.
That is certainly a limitation for now, but plenty of academic research is being done on how to address that in a more individualized way. That said, the models also have the advantage of synthesizing learnings from user interactivity back into a future release and essentially applying that globally, which is pretty neat.
There's also some cool techniques to sort of bridge the gap today, like compound engineering.
>Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.
But that's the thing: it's becoming pretty clear that the "plagiarism machine" can probably take that same problem in a prompt, having never been trained on my code, and still solve it.
In that case...maybe it doesn't feel great to have someone copy my idea. But that is certainly not plagiarism in the way you mean it. And when you put ideas out into the world, you can't be certain that someone else won't copy and remix it into something new. That's kind of how the world works already, but we're just seeing the barrier to entry decline.
Yes, I am. I am very confident that general purpose digital computers will never be more efficient then human minds in generating moderately complex code.
Why am I so confident... Well, it has been over 10 years since AlphaGo beat top go player Lee Sedol. AlphaGo was able to beat the a world class go player by doing several thousands orders of magnitude more computations then Lee Sedol, and it did so by spending several orders of magnitude more energy then the top human go player. Today, over 10 years later, the top go machines are able to beat world class go players much easier, but still do so using the exact same strategy of outcomputing the humans with thousands of orders of magnitude more computations, and spending orders of magnitudes more energy.
Things did not change in the past 10 years, I see no reason why it should change 10 years from now.
Has it not? Why do you say that?
Also, do we still require a Deep Blue sized supercomputer for chess? :)
But regardless, compute will get to a point where human level intelligence close to as efficient as we are. You could argue it already is today, when you factor in the resources that the average person in the west already uses in terms of their overall impact on the planet.
I can just as well describe the future evolution of the internal combustion engine and claim it will get more and more efficient and eventually we will be able to burn oil so efficiently that our personal vehicles can fly through the atmosphere at twice the speed of sound.
There is limitations to digital computers just as there are limitations to internal combustion engines. Our brains are not digital computers. When we use our brains we don’t just do a bunch of linear algebra.
This is a silly comparison. There is a certain quantity of energy stored in oil, so we know what peak efficiency looks like. We don't actually know what amount of energy is required to solve certain problems. We quite literally have models with quite a bit of capability that can run locally on a phone today, right alongside Stockfish, for example.
And this is to say nothing of work happening now on new hardware approaches, such as Normal Computing's work on thermodynamic matrix math: https://www.normalcomputing.com/blog/a-first-demonstration-o...
That said, this feels like a strange tangent: I'm not sure it's that important that the models be as energy efficient as a human brain. We don't avoid cars because they're less energy efficient than our legs. ;)
This matters because unlike cars LLMs are only doing stuff we can already do using our brains, just several orders of magnitudes less efficiently. Cars can at least take us distances we would never be able to using our muscles. In comparison, if I need to compile CPython into a WASM binary I can simply download a library that does it, or copy paste code in a few seconds, for a million billionth of the energy it takes an LLM to do the same. Except when I download the library or copy-paste the code I (hopefully) attribute the original author and give them credit for their work.
I'm suggesting that while LLMs are bounded by physical reality, that you actually don't know what that bound is. Just a few years ago we would have thought it a fantasy to have a conversational model run on a phone.
Even if you could compute it now, that would still be tied to current architectures. With appropriate incentives, we'll continue developing hardware to make these models more efficient to execute. It's very likely that you'll be able to run a Fable caliber coding model on your phone in the next five years.
>This matters because unlike cars LLMs are only doing stuff we can already do using our brains, just several orders of magnitudes less efficiently. Cars can at least take us distances we would never be able to using our muscles.
But that's not largely true of cars. The majority of trips are five miles or less and could easily be replaced with a bicycle. While I might personally use a bicycle, the majority choose a car to save a bit of time and effort.
So, please continue to enjoy your car, and I will continue to enjoy ready access to an LLM for a variety of other tasks. My inference energy costs are almost certainly less than your vehicle usage. ;)
OK then - do it, faster.
> You can take comfort in the fact that a few months later some[...] developer can [solve] the same problem [using your work]
Isn't that what collaboration and sharing software is supposed to be all about?
On the other hand: "Stop trying to make 'clanker' happen! It's not going to happen!"
"AI slop" caught on but "clanker" did not.
It caught on, sure, but not exactly in the way I expected. The wild popularity of "slop" as a term for AI eventually gave way to the genericization of the word "slop" to mean "content of low quality, regardless of source", and is seemingly being used as just a derogatory term for anything that people dislike (particularly by folks in left leaning communities). For example, I've seen people refer to (clearly human written) commentary from some political commentators as "slop".
You comment kind of reinforces the idea by the fact that you have to now say "AI slop" specifically to disambiguate it. It's kind of a fascinating little turn.
The earliest OED2 citation of "slop" for the sense "figurative. Nonsense, rubbish; insolence" is 1952. Slop was slop long before "AI slop" was coined, and AI slop is slop from an AI.
We're a society built by thought and good-will engagement. We won't get out of our "rules for thee" with less thought and less good-will engagement.