undefined

upvote

points

by simonw5 days ago |

upvote

by teiferer5 days ago|

[-]

> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.

reply

upvote

by zylepe4 days ago|

[-]

Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

reply

upvote

by aspenmartin4 days ago|

[-]

If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.

reply

upvote

by ElevenLathe4 days ago|

[-]

> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

reply

upvote

by aspenmartin4 days ago|

[-]

> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.

reply

upvote

by andai4 days ago|

[-]

Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

reply

upvote

by aspenmartin4 days ago|

[-]

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.

reply

upvote

by Eisenstein4 days ago|

[-]

Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...

reply

upvote

by aspenmartin3 days ago|

[-]

Yes but so what right? This is a problem for both alignment evals and actual cheating (e.g. someone forgot to delete .git history and the model was able to back out the original PR, or they can decrypt something by finding a key, etc), but both of these are beyond the scope of what I'm talking about. The impact on these evals that are affected is small, and so what if you know you're being evaled when I ask you to give a new proof for a conjecture? I just care whether or not you can do it...

reply

upvote

by Eisenstein3 days ago|

[-]

I'm not responding to 'it doesn't matter if they know they are being evaluated', because that isn't what you mentioned in your comment. What you said was 'they won't know they are being evaluated', which is what my reply addressed.

reply

upvote

by aspenmartin3 days ago|

[-]

Oh ok well then you’re definitely right about that, they can tell and sometimes it really matters (I can’t remember if it was SWEBench or not but there was a major benchmark where the models were just inspecting git histories that were leaked into the dataset). The more insidious one is alignment but idk alignment research that well to know if this is a big deal or not.

reply

upvote

by ElevenLathe4 days ago|

[-]

I'm not suggesting anyone is doing anything, just stating the objective fact that it is definitely possible for closed-weight model developers, and would be super hard to detect outside of this limit scenario you posit, where it is provably impossible for the provider to have seen the benchmark before it was run (which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking).

To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

reply

upvote

by aspenmartin3 days ago|

[-]

Its not a limit scenario is my point: these models are evaluated constantly, new benchmarks both public and proprietary are in constant development, benchmarks are not always static either, they can often times be living benchmarks that update over time.

You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.

> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking

yes this is incredibly common. I'm not talking about hypothetical scenarios.

> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.

reply

upvote

by teiferer4 days ago|

[-]

> This is...just incredibly conspiratorial and a bit silly.

Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.

reply

upvote

by aspenmartin3 days ago|

[-]

They don't have control over measurement. Consider also it's easy to figure this out and it creates a scandal. Like I said, consider Llama 4 which a lot of people pointed out used a custom model in LMArena to inflate their scores; its never clear what the true underlying story for this, but regardless that model release spurred billions of dollars of spending on new talent and a complete gutting of that org.

These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.

reply

upvote

by bcrosby954 days ago|

[-]

Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

reply

upvote

by aspenmartin4 days ago|

[-]

Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.

reply

upvote

by joquarky4 days ago|

[-]

Imagine unironically starting your comment with "Um" in 2026.

reply

upvote

by jaapz4 days ago|

[-]

As opposed to your incredibly useful contribution to this thread, thanks!

reply

upvote

by aspenmartin4 days ago|

[-]

You don't have to imagine!

reply

upvote

by naikrovek4 days ago|

[-]

ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.

reply

upvote

by aspenmartin4 days ago|

[-]

You are literally describing a benchmark

reply

upvote

by nahrin4 days ago|

[-]

100% agree on this! These new models best performance is always experienced in the first hour of communicating with them. If you have a specific problem with a clear goal in mind, then you have one hour to get the best out of any AI model. Personally, every time I took an AI suggestion, I walked through a wall sideways. AI is hands down a smart technology that throws dictionary vibes!

reply

upvote

by p-e-w4 days ago|

[-]

Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.

reply

upvote

by bluGill4 days ago|

[-]

> students are evaluated by teachers with more knowledge and experience than them

This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

reply

upvote

by JadeNB4 days ago|

[-]

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration)

I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)

reply

upvote

by cycomanic4 days ago|

[-]

It certainly is true in physics and engineering that a PhD student at least half way through their PhD should know more than there supervisor about their topic (and usually much earlier). Even a Masters thesis project student should understand the intricacies of their project better than their supervisor. I'm speaking as someone who has supervised a significant number of both PhD and Masters students.

reply

upvote

by camdenreslink4 days ago|

[-]

The original post said “in college”. It might be true for PhD candidates halfway through their program, but that’s like 0.5% of college students. The vast majority of students are leagues behind their instructors in domain knowledge.

reply

upvote

by bluGill4 days ago|

[-]

I wouldn't say leagues behind, but otherwise I think we are on the same page, though I guess I worded it wrong. It is common for a couple students in any class to know more than the instructor in some niche part of the field even though the instructor has much more knowledge overall.

reply

upvote

by JadeNB4 days ago|

[-]

Yes, I intentionally left out the next part of the quote about graduate school, since that seems more accurate. I was disputing only the part that I took to be pertaining to undergraduate education. The full quote is:

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

reply

upvote

by cycomanic4 days ago|

[-]

Ah apologies, that's what I get for skim reading and kneejerk replying. I completely agree with you, undergrads are highly unlikely to know more about a subject than their professor (obviously there can always be exceptions).

reply

upvote

by teiferer4 days ago|

[-]

A grad student is evaluated by how well they are capable of following scientific procedures, communicated their results and have a sufficiently broad knowledge foundation. All that can easily be verified by a professor in a related field since they are very experienced in all those things. They don't actually need to be experts in the specific narrow topic the student has become the world expert in.

reply

upvote

by aspenmartin4 days ago|

[-]

> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

reply

upvote

by Jensson4 days ago|

[-]

> How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.

But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.

reply

upvote

by aspenmartin4 days ago|

[-]

Yet human judgement isn’t subject to side effects like fluency and persuasiveness? It’s like everyone in this thread dismisses benchmarks and then…describes a crappy benchmark.

Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.

Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?

reply

upvote

by andai4 days ago|

[-]

I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)

reply

upvote

by ishurand44 days ago|

[-]

The only one I see that thinks it is claude other than claude itself is the GLM series.

reply

upvote

by throw109204 days ago|

[-]

I have screenshots of Deepseek V4 doing this too - in a non-Claude-Code harness.

reply

upvote

by andai3 days ago|

[-]

Also MiMo...

reply

upvote

by Wowfunhappy4 days ago|

[-]

Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.

reply

upvote

by naikrovek4 days ago|

[-]

> real life resists those kinds of measurements

no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.

Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.

"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.

zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.

reply

upvote

by Wowfunhappy3 days ago|

[-]

Don't you think this applies to LLMs too?

reply

upvote

by tsss4 days ago|

[-]

> determine quantitatively forever whether Rust is a superior programming language to Go

Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.

reply

upvote

by lukan4 days ago|

[-]

So .. where can we read about the results?

reply

upvote

by karunamurti4 days ago|

[-]

ugghh, benchmarks?

reply

upvote

by lukan3 days ago|

[-]

Benchmarks about the superior programming language?

You mean benchmarks about the programming language that produce the fastest code?

That is not really the same.

reply

upvote

[-]

deleted

reply

upvote

by Certhas5 days ago|

[-]

There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

reply

upvote

by johnisgood5 days ago|

[-]

Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

reply

upvote

by lanstin4 days ago|

[-]

"Check your work for mistakes after the first draft" maybe :)

reply

upvote

by hardwaregeek4 days ago|

[-]

Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

reply

upvote

by AlecSchueler4 days ago|

[-]

No, relative performance between Python and Java can absolutely be measured.

reply

upvote

by skywhopper4 days ago|

[-]

Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.

reply

upvote

by andai4 days ago|

[-]

I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

I know how stupid that sounds but it's true.

Well what do they say... "If it sounds stupid but it works, then it's not stupid!"

reply

upvote

by bfrog4 days ago|

[-]

How do you measure the performance of people? This is subjective and biased every time.

reply

upvote

by stray4 days ago|

[-]

I have a couple projects that have completely stalled because none of the frontier models could advance any further with them - I'm going to give fable a try at them this coming weekend.

I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.

ymmv

reply

upvote

by theshrike794 days ago|

[-]

Yes, words matter.

My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".

AV professionals always say "timecode" - timestamp is a programming term.

Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".

reply

upvote

by contextfree5 days ago|

[-]

fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

reply

upvote

by contextfree3 days ago|

[-]

Addendum: Interestingly, it ended up taking me about the same amount of time - 8 hours or so - to hit the "vibe limit" with Fable. But in that amount of time I made about 5-10x as much progress. So my feelings are:

1. It's exponentially better

2. yet, somehow, hand coding still isn't dead, at least for me

reply

upvote

by thewhitetulip5 days ago|

[-]

How many $ do you guys spend when your session runs for 30min? What's the total budget?

reply

upvote

by contextfree3 days ago|

[-]

I just have a regular Claude subscription and keep within its usage limits

reply

upvote

by thewhitetulip3 days ago|

[-]

But isn't running Claude models for 30min expensive? Or is Claude Code not expensive?

I use Cursor and if I ran Claude models for 30min I might exhaust my mobthly budget! Maybe it's an API billing issue though

reply

upvote

by contextfree2 days ago|

[-]

It's included free with subscription plans until June 22. I get about 2 hours a day of usage through Claude Code until I hit my usage limit. I just use it for 2 hours then wait for the next day.

reply

upvote

by solumunus4 days ago|

[-]

Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.

reply

upvote

[-]

deleted

reply

upvote

by ElFitz5 days ago|

[-]

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

reply

upvote

by farley134 days ago|

[-]

I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.

reply

upvote

by ElFitz4 days ago|

[-]

[dead]

reply

upvote

[-]

deleted

reply

upvote

by theshrike794 days ago|

[-]

IMO comparing different models is like comparing songs or paintings or modern art.

There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.

Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.

You can also do benchmarks but how do you measure the output of those?

The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.

reply

upvote

by locknitpicker4 days ago|

[-]

> IMO comparing different models is like comparing songs or paintings or modern art.

I don't think this is that subjective or vague.

There are a couple of crisp metrics that can be used to evaluate a model:

- given a prompt, does it finish a task (times X tasks)

- how much did it cost to finish the task

- how long did it took?

If all models are able to handle a class of tasks, they perform equally well.

If a model costs much more to finish a task, it is worse than other models.

If a model takes longer to finish a task, it is worse than other models.

The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.

reply

upvote

by theshrike793 days ago|

[-]

"Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

Or just that it's so much cheaper that the cost/benefit ratio is better?

Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

reply

upvote

by locknitpicker3 days ago|

[-]

> "Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.

And that says it all.

> Or just that it's so much cheaper that the cost/benefit ratio is better?

That too is another definition of quality, isn't it?

If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.

> Also "finish a task" is also subjective.

No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.

> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".

I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.

And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.

reply

upvote

by vonneumannstan4 days ago|

[-]

The first thing in the release page is benchmark results...

https://www.anthropic.com/news/claude-fable-5-mythos-5

reply

upvote

by ivanovm4 days ago|

[-]

The benchmarks are now the equivalents of SAT/ACT/other standardized exams for humans. They are directionally quite predictive, but with plenty of outcome variance on the margins

reply

upvote

by torginus4 days ago|

[-]

Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed

reply

upvote

by lqstuart4 days ago|

[-]

It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ

reply

upvote

by kmacdough4 days ago|

[-]

I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.

reply

upvote

by alecco4 days ago|

[-]

> These comparisons are all gut feelings.

https://simonwillison.net/about/#disclosures

"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

But I'm totally unbiased on my gut-feeling posts, trust me bro.

-- AI influencers.

reply

upvote

by simonw4 days ago|

[-]

Anthropic didn't give me early access to this model, shouldn't that bias me against it?

reply

upvote

by deagle504 days ago|

[-]

You kinda proved the point...

reply

upvote

by simonw4 days ago|

[-]

How?

reply

upvote

by deagle504 days ago|

[-]

If you're that easily biased then why trust your assessment?

reply

upvote

by simonw4 days ago|

[-]

Where did I say I was biased?

reply

upvote

by deagle504 days ago|

[-]

the hypothetical you presented above

reply

upvote

by simonw4 days ago|

[-]

It was a hypothetical. How does presenting a hypothetical equate to proving anyone's point here?

reply

upvote

by deagle504 days ago|

[-]

you implied that not being given early access could bias you in the other direction. Which in my opinion would demonstrate that you are easily biased. Which would then call into question any opinion you share about the subject.

reply

upvote

by simonw4 days ago|

[-]

Someone accused me of being biased in favor of model providers who give me early access, after I praised Fable's performance.

I said "Anthropic didn't give me early access to this model, shouldn't that bias me against it?"

I was explicitly pointing out that their failure to give me early access had not, in this case, lead to me reviewing their model poorly.

I try very hard not to let things like early access affect my reviews of models. I was hoping this particular situation could help illustrate that.

reply

upvote

by munksbeer4 days ago|

[-]

Don't feed the trolls Simon.

reply

upvote

by alias_neo4 days ago|

[-]

This isn't some random dipshit, this is Simon Willison[1]. He has a bit more cred than some "AI influencer".

[1]https://en.wikipedia.org/wiki/Simon_Willison

reply

upvote

by bigboggerlogins4 days ago|

[-]

[dead]

reply

upvote

by tezza5 days ago|

[-]

[flagged]

reply

upvote

by 1dom5 days ago|

[-]

[flagged]

reply

upvote

by tezza5 days ago|

[-]

check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.

Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.

If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.

How is a side by side direct comparison NOT precise?

[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix

.

[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.

reply

upvote

by 1dom5 days ago|

[-]

I did browse and check the links. This was the first link I went to: https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... as it's the main one on the page, and I saw more qualitative stuff without quantitative stuff.

I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.

I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.

Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.

reply

upvote

by tezza5 days ago|

[-]

There are benchmarks if you want quantitative results. Mine is qualitative, and clearly billed as such. Comparison and contrast still possible.

reply

upvote

by ksec3 days ago|

[-]

My good lord Tezza. You still have claim and composed response after that sort of insults being throw at you. Haven't seen one this bad for quite sometime on HN. I hope you have a great day.

reply

upvote

by lionkor4 days ago|

[-]

This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.

reply

upvote

by user439284 days ago|

[-]

I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.

In my opinion, if one cannot express themselves civilly, they should refrain from commenting.

reply

upvote

by 1dom4 days ago|

[-]

I disagree. I wouldn't consider it unhinged. I'm clearly aware of my own frustration. It's also relatively civil, since I was able to temper it with appropriate apologies and acknowledgements. Many other people agree and support the sentiment of what I'm saying.

AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.

It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.

reply

upvote

by user439284 days ago|

[-]

I found the website you ranted about interesting, comparing the quality of the visualization between the different models.

I don't think it was "a huge waste of time" or needed your rant.

You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.

What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.

reply

upvote

by 1dom4 days ago|

[-]

This is slop, in the sense that it looks like a lot of useful work and effort, and AI is heavily involved, and it was offered up when the opposite was requested, meaning it's not at all helpful in this context.

I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.

I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.

The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.

reply

upvote

by jgilias4 days ago|

[-]

Oh boy. I see this so much.

reply

upvote

by bigboggerlogins4 days ago|

[-]

[dead]

reply

upvote

by throw109204 days ago|

[-]

> I reads like an unhinged rant about AI

> if one cannot express themselves civilly

It was neither unhinged nor uncivil. Maybe you responded to the wrong comment by accident?

> they have permission to insult someone's competence and work

If it's AI, it's not your work. And even if it was - criticism of your work is not a personal insult. This criticism is flatly invalid.

reply

upvote

by user439283 days ago|

[-]

You think it was civil when the comment started with:

> this post gets me irrationally irritated and makes me want to shake you and shout

Yes, criticism of my work would not generally be a personal insult.

However, if you were to call my work 'slop', and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level. This is not a civil or respectful way to talk to someone.

reply

upvote

by throw109203 days ago|

[-]

> You think it was civil when the comment started with:

>> this post gets me irrationally irritated and makes me want to shake you and shout

Did you read the rest of the comment? The rest of it is civil. It's normal for people to start by saying something like "this makes me frustrated" as a preface to indicate their feelings, and then not actually act frustrated and instead calmly work through their thoughts. That is a meatspace social convention (not just an online one) - are you not aware of it?

> However, if you were to call my work 'slop'

And, as previously established, if you use AI, it's not your work.

> and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level

...and those are still criticisms of your work, not yourself.

The actual problem here is that you are taking offense to things that are not offensive, not that the parent poster was being uncivil. Thinking that calling someone "inexperienced" is a personal insult is absolutely insane. That's a wildly miscalibrated sense of how social dynamics work and what it actually means to insult someone.

reply

upvote

by leodavi4 days ago|

[-]

How is this meaningfully different than simonw's pelicans riding a bicycle? If anything, this seems to be of a higher caliber?

reply

upvote

by 1dom4 days ago|

[-]

simonw's pelicans probably wouldn't get posted in response to a request for a more quantitative analysis.

You and others are right though, that there's potentially interesting or enjoyable stuff in there (maybe I should have lead with that?). It's just a large volume of it is not useful in response to a question specifically looking for more quantitative or detailed usage analysis.

reply

upvote

by bigboggerlogins4 days ago|

[-]

[dead]

reply

upvote

by thewhitetulip5 days ago|

[-]

It feels like hand written software will now be "bespoke"

reply

upvote

by disgruntledphd24 days ago|

[-]

artisanal, hand-crafted software.

reply

upvote

by kansface5 days ago|

[-]

Yes, exactly this. If I didn't care about price at all, I'd exclusively use this model. It functions more like an actual engineer. I'm in the midst of a DB migration, and eg 5.5 continually suggests stuff like "use DB X instead of DB Y for task Z because its 30% faster" which is an impossibility of reality, given we are migrating DBs. Fable jumped in, reduced allocs by literally 46x, found multiple bugs 4.8 and 5.5 created (max file system usage, correctness issues, etc), and continually suggested awesome improvements unprompted. As in, it would finish a task and then suggest we tackle this other existing problem I didn't know about in a very specific manner... this is the first model that feels like its coming for my job.

reply

upvote

by josephg5 days ago|

[-]

I'm having the same experience. I'm in the process of implementing a new CRDT for realtime collaborative editing. There just aren't a lot of implementations of CRDTs kicking around online for opus or any of the other models to have good design instincts.

Fable is doing - so far - a great job. I just had one big question around how part of it should work. I had a design sketch, but with some big unknowns. I asked fable to figure it out via reasoning and prototyping, and it did - it even, under its own initiative, wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it. And it found, and fixed, a couple bugs that I'd missed.

I'm sure its weaknesses will become apparent in time. But, wow this thing is a beast. Its the first time I'm reading the work of an LLM without spotting obvious weaknesses in its reasoning and code. I'm really impressed.

reply

upvote

by infinitebit5 days ago|

[-]

I was about to ask where you work that you’re implementing new CRDTs and then I noticed your username! Thanks for all that you do!

I work on the live collab at my company, and using AI while coding has into recently sort of “clicked” for me. We use an (I’m pretty sure) unheard of algorithm for collaborative editing, and I’ve had a long term goal of turning it into an implementation of EG Walker, but our document model is very complex and most out of the box CRDTs don’t quite fit. Maybe Fable will be what gets me over the hump.

reply

upvote

by aquariusDue4 days ago|

[-]

Long shot here because I'm not knowledgeable enough about CRDTs but maybe something like DSON would help? I saw a talk about it a while ago and it might be useful.

https://blog.helsing.ai/posts/dson-a-delta-state-crdt-for-re...

https://www.youtube.com/watch?v=4QkLD7JhD_I&pp=ygUJZHNvbiBjc...

reply

upvote

by infinitebit4 days ago|

[-]

Ty, checking this out!

reply

upvote

by josephg5 days ago|

[-]

I’d be fascinated to hear more if you’re willing to share. What is special about your document model which makes existing tools like automerge a bad fit?

reply

upvote

by infinitebit4 days ago|

[-]

We have cross-field invariants that merging at the data structure level can't ensure (in an obvious way, at least), and "lose the semantic meaning of a conflict". The main idea behind their approach is that certain parts of the model can have custom "mergers" that are able to run business logic to maintain these invariants.

Worth noting, the decision to eschew CRDTs predates my time here, and I've pushed for a CRDT rewrite quite a bit since I believe it could be done. The other main concern they had was memory usage, but it seems like EG Walker would solve that. Our system uses a "Commit DAG", (an Event DAG by another name), and does a three-way merge using a common ancestor of the diverged documents, and so a lot of the bones of EG Walker are there, and I'm exploring ways in which we could gradually move to it.

reply

upvote

by hnewsdaniel5 days ago|

[-]

Hello joseph,

I saw scanning the comments and saw you mentioned CRDT. Just wanted to mention that I implemented a CRDT-flavoured sync engine for the product I'm working on a while ago, I think it was with Opus 4.6 if I'm not mistaken (or earlier) so it's not something new to Fable 5, just fyi.

reply

upvote

by josephg4 days ago|

[-]

Yeah, you've certainly been able to get Opus to write a CRDT. It just needs a lot of hand-holding to make it correct. Opus always seems pretty bad at coming up with invariants and using them to make a piece of software correct. Without invariants, you end up with lots of hacky workarounds to avoidable problems.

So far at least - and its been less than a day - Fable seems better at this.

I think I also do my CRDTs differently from others. I've grown to like the pure-oplog approach after making eg-walker. LLMs are much worse at this!

reply

upvote

by hnewsdaniel4 days ago|

[-]

[flagged]

reply

upvote

by teiferer5 days ago|

[-]

> wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it.

For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.

reply

upvote

by josephg5 days ago|

[-]

I’ll ask it for a formal proof when I get home and see how it goes.

I’ve read plenty of papers with “formal proofs of correctness” that turned out to have huge flaws. Machine verifiable proofs I trust. But I’ve personally found more bugs with fuzzing than I have via proofs.

reply

upvote

by teiferer3 days ago|

[-]

Oh I actually mean machine checked. Indeed, formal pen-and-paper proofs can have flaws, since they are essentially code without test coverage.

reply

upvote

by noduerme5 days ago|

[-]

In the real world, many of us don't have the time to create formal proofs. But our instinct in testing where edge cases may exist in code that we wrote is a type of refactoring that happens in our brains during the coding process. Hand the coding off to a machine and you have no idea where to start looking for the flaws.

reply

upvote

by bluGill4 days ago|

[-]

> Hand the coding off to a machine and you have no idea where to start looking for the flaws.

I have found this quickly becomes false. I have learned I cannot review llm generated code as if it is written by a trusted senior developer (where I often just do a quick look, see nothing obvious and hit approve). Once you start reading the code in depth with the goal of understanding you quickly see the places where flaws are likely. Sure I start with no clue where to look, but it doesn't take long to see things.

reply

upvote

by noduerme4 days ago|

[-]

Yes but it takes much longer to trace them. Because the LLM code almost always gravitates toward data blobs and highly dynamic objects and spaghetti that takes a ton of cognitive load to understand what their failure modes are. Even when it does document them.

reply

upvote

by teiferer3 days ago|

[-]

> In the real world, many of us don't have the time to create formal proofs

Of course not. That's why they are so rare. But I thought we live in an AI era now where this kind of stuff can be done by a machine.

reply

upvote

by weatherlite5 days ago|

[-]

> this is the first model that feels like its coming for my job

Damn you must be good, I've been feeling this for around 2 years now

reply

upvote

by literalAardvark5 days ago|

[-]

It's been obvious for at least 2 years, anyone who doesn't see the writing on the wall simply hasn't learned how to use these well or has severe exponential blindness.

"But it doesn't do well when writing my undertrained language" - yeah, fine. Yet. Reasonable code in that is probably one RAG + verification scaffold deployment around Mythos or maybe mythos+1. Just like it was for you learning it, because you knew how to _program_.

reply

upvote

by weatherlite4 days ago|

[-]

Yeah I agree. We're headed into a rougher job market pretty much across the board for white collar work , hitting junior people worse at this stage. Up to societies around the world to decide how to deal with this - so far we deal with it by ignoring it it seems.

reply

upvote

by 10GBps4 days ago|

[-]

The monks got mad too when the printing press was invented because it took their jobs of hoarding knowledge.

AI is just another tool, learn to use it.

reply

upvote

by FeteCommuniste4 days ago|

[-]

And then in a couple years the AI gets better at "using AI" than the bottom 99.999% of knowledge workers, who are now out of work.

reply

upvote

by OtomotO4 days ago|

[-]

We are all doomed! Doomed I say!

reply

upvote

by spoiler4 days ago|

[-]

Gosh, I must be doing something wrong. I spent 15 minutes (of which a lot was waiting while it was thinking about "backwards rationalising" it's decision and "gaslighting"[1]) arguing with it over why it keeps using `node -e "console.log(require('fs').readdirSync('…'))"` instead of `ls -l …`.

Like it did everything:

- this is not a Linux system (true, it was macOS) - it is not an available command - the binary is corrupted - node/js is more precise - V8 JavaScript is faster than bash (true technically??? But not in this context lol) - JavaScript is more versatile

I forgot what else we went through but there were a few more things. I indulged it because it was incredulous and funny. The prompts from my side were all questions, never instructions. I assume an instruction would've helped here, but also I don't think Opus ever did this (but on the other hand Opus wrote python scripts to format/indent, instead of just running cargo fmt, so I guess potato potato)

reply

upvote

by boc5 days ago|

[-]

Yeah same here, Fable on "high" is producing substantially better results than Open 4.8 on xhigh for me and my actual real-world evals today. It "feels" smarter and doesn't use nearly as many tokens running in circles. As a result I've been able to run two large refactors today without hitting the context limit danger zones - it's more expensive but also more efficient. It's been able to find some bugs that Opus missed. Pretty impressive stuff.

reply

upvote

by garciasn5 days ago|

[-]

I keep getting this message:

> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

I'm working on an internal tool that does new business prospecting data collection, scoring, etc. This is ridiculous.

reply

upvote

by algoth15 days ago|

[-]

It’s unusable for me due to the refusals. I’m using claude to find patterns in health data

reply

upvote

by yakz5 days ago|

[-]

I do some work in laboratory automation and it was quick to refuse the first thing I asked it to do. There wasn't anything spicy in the request, just basic liquid-handling protocol implementation. Their position seems to be that they're too stupid to classify requests safely, and that seems reasonable to me. I'd guess the classifier will improve rapidly.

reply

upvote

by 5d41402abc4b5 days ago|

[-]

Have you tried locally running qwen?

reply

upvote

by mrbuttons4544 days ago|

[-]

Is there a Qwen that I can run locally that is anywhere near these frontier models?

reply

upvote

by Der_Einzige4 days ago|

[-]

No, and don't let anyone gas light you into thinking the answer is yes.

reply

upvote

by dmd5 days ago|

[-]

Same. I'm working on a set of python and matlab scripts that deals with segmenting MRI images into brain vs skull, and it thinks that's bioterrorism.

reply

upvote

by mdgld5 days ago|

[-]

[dead]

reply

upvote

by rvnx5 days ago|

[-]

Quite counterproductive to refuse to help on health issues too. If they detect health data, they can add a disclaimer, but not hide the information.

reply

upvote

by secult5 days ago|

[-]

You miss the point - by collecting and processing medical data they would fall into a thoroughly regulated industry. Not because they may provide you incorrect data, because they are not allowed to process them.

reply

upvote

by fragmede5 days ago|

[-]

What custom prompt do you have set up? If you tell it you're occupation, does it turn helpful? There was a study that if you tell models they tested that you're a patient, it would refuse, but tell it you're a doctor and suddenly it turns helpful.

reply

upvote

by garciasn5 days ago|

[-]

According to the model, it’s not the model itself that’s doing this, it’s the harness.

Assuming the model is being “truthful”, CC is just being stupid in its detection mechanism.

reply

upvote

by UltraSane5 days ago|

[-]

Anthropic knows it refuses too much, they want to be very cautious to avoid any scandals. I think this is why they want to store all Fable and Mythos chats for 30 days so they can use the data to improve.

reply

upvote

by hirako20005 days ago|

[-]

They want to be very cautious to honour the important doctrine at least until IPO launches: we are so good we are nerf our products.

reply

upvote

by fn-mote5 days ago|

[-]

I’m a point where I expect everything I do will be retained indefinitely.

I’m having a really hard time believing some weak reason for a 30 day retention policy.

reply

upvote

by girafffe_i5 days ago|

[-]

There’s no way around it? Can’t you obfuscate as generic data and use keys to map to the real data?

reply

upvote

by algoth14 days ago|

[-]

I guess you could even turn everything into numbers, not a bad idea at all!

reply

upvote

by 5d41402abc4b5 days ago|

[-]

what prompts do you use for this?

reply

upvote

by garciasn5 days ago|

[-]

I wonder if it sees Healthcare companies being targeted and that's why it's freaking out; clearly they have some pretty stupid regexes in the harness to detect this sort of shit.

e: I quit the session and went back in. Set it to Fable and told it to continue the last session. It's moving along as if none of that had happened.

How weird.

reply

upvote

by throwaway202225 days ago|

[-]

I wonder if this letter has anything to do with why anything even remotely related to biology is getting flagged.

https://www.wired.com/story/openai-anthropic-letter-ai-biolo...

reply

upvote

by andy12_4 days ago|

[-]

I don't know if you are aware, but some people reported in Twitter that Fable 5 may flag the message regardless of content if it knows (from either pretraining knowledge or memories) that you work in either of those fields. I don't know if that's your case.

https://x.com/i/status/2064449457869984035

reply

upvote

by iambateman5 days ago|

[-]

I asked a question for my son about how mosquitos carry malaria and Fable was like “ok now hold it right there”

reply

upvote

by piokoch5 days ago|

[-]

Obviously, soon, for anything valuable, you will have to buy from Anthropic "special license for biology/security/finance advises".

Question is if there will be any competition in this area...

reply

upvote

by LouisvilleGeek5 days ago|

[-]

Same here. It's been rushed for the IPO (in my opinion).

reply

upvote

by fragmede5 days ago|

[-]

Or people were quitting their subscription for codex-5.5 and it was beginning to show up in their metrics.

reply

upvote

by brookst5 days ago|

[-]

Or development had gotten to a point where they need real world usage to tune product and refusals.

Or Fable’s arch is different enough the allocated clusters of compute targeting a date, and here we are, ready or not.

Or…

reply

upvote

by the__alchemist4 days ago|

[-]

Interesting! I have not used Fable, but so far have not hit trouble. I'm a hobby biologist with a home mol bio lab. It wouldn't answer my questions about LNPs, but so far has been fine for my recombinant DNA workflows, lab techniques, environmental DNA protocols etc. I suspect this may become more difficult!

reply

upvote

by fumar5 days ago|

[-]

Same I am working on music firmware for existing device. I can't proceed as it keeps switching to Opus.

reply

upvote

[-]

deleted

reply

upvote

by black_knight5 days ago|

[-]

Still does not crack my hardest nuts. Gave it one of them and it blew through my entire allowance on thinking about one question, with no apparent answer in sight!

I see a lot of people saying they are happy with weaker models, but I am the opposite, I need more strength, more intelligence!

I am quite happy that opus 4.8 can do some medium intelligence problems. And maybe Fable 5 can do some more more of those! I have a lot of problems to solve!

reply

upvote

by user439285 days ago|

[-]

I also see a lot of people saying they are happy with weaker models.

At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.

The results were near useless.

The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.

Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.

Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.

reply

upvote

by black_knight5 days ago|

[-]

I have Qwen 3.6 27B and 35B running locally and and coming from Opus it feels like talking to an imposter. Someone who pretends to be competent, but really isn’t. Results are always disappointing. Sonnet is better, but I have given up on asking it. even for simple things I wait for my opus limits to reset.

reply

upvote

by abalashov4 days ago|

[-]

Have you tried Kimi K2.6 or DeepSeek V4 (Flash or Pro)?

reply

upvote

by daymanstep5 days ago|

[-]

What kind of problems are you trying to have it solve ?

reply

upvote

by _kb5 days ago|

[-]

The Riemann hypothesis, PvNP, and the Collatz conjecture.

reply

upvote

by black_knight5 days ago|

[-]

Not these. I wonder if the well is poisoned there. The models know that these are "unpossible", so it might not solve them just because… Maybe some day.

I am just testing it on stuff I know intimately myself. I would probably not understand a proof of Collatz if it was dansing in front of me!

reply

upvote

by komali25 days ago|

[-]

So, what kind of problems are you having it try to solve?

Sorry to belabor this but it's basically pointless saying you have nuts it can't crack without showing us the nuts.

reply

upvote

by black_knight5 days ago|

[-]

I don’t care to share my exact problems. Mostly because gpt -5.5 hallucinates false solutions, and I would rather not have people reply with "Oh but ChatGPT solves it!", because it takes expert knowledge to debunk them. To their credit ChatGPT will admit their, very fundamental mistakes when pointed out to them. But also because no-one would really care.

I gave a high level description of the problems in a sibling thread. They are the kind of small problems which I suppose every researcher has lying around, waiting for them to think about some day. But not the big problem everyone is waiting for to be solved.

My comment was not meant to be a tease – sorry! I assumed there would be other people in a similar situation, who might relate.

reply

upvote

by neonstatic5 days ago|

[-]

Bro, you are being left behind bro, it's amazing bro...

reply

upvote

by Lerc5 days ago|

[-]

That's a bit of a tricky point. I have had quite a lot of problems with models informing me what I am attempting is impossible. If no-one has done it, or at least it doesn't know about it being done it tends to fall back on people voicing their baseless speculations, and for just about anything you propose, you can find a person who will loudly proclaim it is impossible.

The curse of the 'use case' comes in here too. When people think that everything should have a use case, that's a lot of training data suggesting to a model that things should only be used for what someone has already thought of.

A couple of times I have had to manually code proof of concept pieces so that the model breaks out of that "unpossible" mode and actually helps me.

I can't remember if it was chatGPT or Claude, but when I showed it how to get a MessagePort in its JavaScript executor through to the artifact/canvas, it quickly went from "That can't be done" to positively enthusiastic about the possibilities. I suspect those shenanigans will be well off the table for Fable though.

reply

upvote

by unnouinceput5 days ago|

[-]

Stop dancing and share the prompt, we're dying to see it

reply

upvote

by black_knight5 days ago|

[-]

Hey, stop asking to see my nuts! My nuts are private – okay?

(Joking aside, see sibling threads.)

reply

upvote

by andriy_koval4 days ago|

[-]

> The Riemann hypothesis, PvNP, and the Collatz conjecture.

Did you add "make no mistake" to your prompt?

reply

upvote

by mastermage5 days ago|

[-]

is this a joke? Seriously? These are some of hardest problems in Math period. 100 if not thousands of the greates minds in history have attempted to solve these problems. And you think that the current level of AI can blow through them? It is also a possibility that for example the Riemann Hypothesis is just not provable. (Goedels Theorem).

reply

upvote

by black_knight5 days ago|

[-]

No one is expecting that! I expect _kb was sarcastic/making a point.

Recently (last couple of months?) these models are becoming useful tools for mathematicians, because they can solve easier problems more quickly, meaning that one can tackle bigger challenges (but maybe not RH et al) piece by piece.

But, there are still definite limits, where one could expect an expert human to solve things, given time, but models do not. Thus, more intelligence would be nice!

reply

upvote

by mastermage5 days ago|

[-]

if it was sarcastic then whoosh on me.

reply

upvote

by _kb4 days ago|

[-]

It was a bit of humour. It would be much for feasible to have an LLM generate programs that solve those problems rather than solving directly. I tried to make a start, but I couldn't even vibe a simple tool that would let me reliably validate if generated solvers would halt or loop forever.

reply

upvote

by mastermage4 days ago|

[-]

> if generated solvers would halt or loop forever.

I am pretty sure this time I am catching the sarcasm here. Kudos you had me in the first half.

reply

upvote

by moffkalast4 days ago|

[-]

Ayy lmao

reply

upvote

by black_knight5 days ago|

[-]

The medium ones are results where one needs to construct some object, which my intuition tells me should exist. The difficult ones are typically to show that certain objects can not be constructed.

These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.

reply

upvote

by Certhas5 days ago|

[-]

I have some medium difficulty math problems where I have used the models for the last year and a half repeatedly. Back then they were already good at pointing out obstructions and constructing counterexamples. So that tracks. But at first glance it looks like Fable actually made real progress on one problem for the first time.

A year ago my judgement was that I had wasted my time on trying to work with the models and doing things myself would have been more productive as I would have gained intuition from the failures. Now it definitely seems to have figured out stuff that would have taken me more time than I have to spare on this problem...

reply

upvote

by black_knight5 days ago|

[-]

Cool! Yes, we are getting there.

Being a theory builder more than a problem solver I am excited for the future.

Also excited for fully formalised mathematics to hit main stream!

reply

upvote

by tclancy5 days ago|

[-]

Perhaps you should rephrase those nuts?

reply

upvote

by sd2k5 days ago|

[-]

That is pretty wild, it took me a hell of a lot more coaxing and persevering to get to a similar point with eryx [0] (we spoke a bit about this before on Mastodon) using Opus, Fable seems to have a more optimistic 'sure, let's proceed as if this is possible' mindset based on your transcript. Looking forward to trying it out for some hairier problems.

[0]: https://github.com/eryx-org/eryx

reply

upvote

by jameson5 days ago|

[-]

Got curious and ran a similar prompt with DeepSeek v4 Pro w/ OpenCode

No idea what's going on here but agent tested a bunch of stuff. Then I asked to build a wheel so I can run the command you noted above and it appears to pass

For those who are curious...

https://github.com/bamggm/micropython-wasm/commit/5ddebae592...

reply

upvote

by jameson5 days ago|

[-]

Mimo v2.5 Pro Ultraspeed w/ OpenCode

https://github.com/bamggm/micropython-wasm/commit/8b362fba1f...

reply

upvote

by larodi5 days ago|

[-]

One thing I can tell you is you are either favored by Anthropic, or your version of the CLI does not exhaust limits, or there's some major bug, as two people around me (myself included) claim it took half an hour to hit the ceiling. Which makes it practically unusable, where the same workflow a day ago produced a good 5-6 hours of workload with several agents.

reply

upvote

by piokoch5 days ago|

[-]

Monetization is coming. They'll tell companies, AI is replacing your workers, so it is still worth to pay 100K/year for the license, as those AI are not going to jump to other job, get sick, be late, complain, require free coffee and so on.

Soon the times of AI for $20/$200 a month will be long gone.

reply

upvote

by tarkin24 days ago|

[-]

Get people hooked, tell them spending time coding is no longer needed, let their skills deteriorate, tell them they need cough up for a licence to do their job

Forcing developers to pay for models that were build on code they scraped scott-free

A tax to do their job that developers are jumping at the chance to pay

Everybody's finally realising that node dependencies are a threat, but letting these AI companies gatekeep the industry is a bandwagon people are scrambling towards

reply

upvote

by witx4 days ago|

[-]

> Forcing developers to pay for models that were build on code they scraped scott-free.

Yes this makes me sad behound explanation. Specially when I see open source developers happily using these tools. These companies stole your, free, hard work and charge you a subscription!! Not to speak about them torrenting books and (most likely) training on private repos.

This and devs paying a subscription to use a tool that is marketed as trying to replace them.

I had 150$ monthly budget thatbI used for various open source projects and I've cut that entirelly.

reply

upvote

by simonw4 days ago|

[-]

> These companies stole your, free, hard work and charge you a subscription!!

In case you weren't aware, Anthropic, OpenAI and GitHub Copilot all have programs that provide access to open source maintainers for free:

GitHub: https://docs.github.com/en/copilot/how-tos/copilot-on-github...

Anthropic: https://claude.com/contact-sales/claude-for-oss

OpenAI: https://developers.openai.com/community/codex-for-oss

reply

upvote

by yencabulator2 days ago|

[-]

> The Claude for Open Source Program is our way of saying thank you for all your hard work, with 6 months of free Claude Max 20x. Apply now.

> Six months of ChatGPT Pro with Codex for day-to-day coding, triage, review, and maintainer workflows

Those are free trials pending their approval in hopes of more paying customers, nothing more.

reply

upvote

by andriy_koval4 days ago|

[-]

Was there comprehensive survey amongst maintainers that its fair price for decades of hard work?

reply

upvote

by majora20074 days ago|

[-]

I don't get what you're saying. You're frustrated that Open Source projects were used to build these AIs and that OS devs (or devs in general) are paying to use AI.

Then you say you had money that you used to donate(?) to OS and have cut that because of the frustration?

Open source just means sharing the source code for people to learn off or have the ability to customize on their own. I don't think there is any need to be frustrated about that (now if it was copyright/private of course).

reply

upvote

by witx4 days ago|

[-]

> Open source just means sharing the source code for people to learn off or have the ability to customize on their own.

Yes people, not corporations. The point is there a licenses to be respected that weren't.

reply

upvote

by lkjdsklf4 days ago|

[-]

Model training pretty clearly falls under fair use.

We could fix that, but it requires a political will to change the law.

reply

upvote

by bingaweek4 days ago|

[-]

This has not been determined in courts and your willingness to speak so confidently about it speaks volumes.

reply

upvote

by simonw4 days ago|

[-]

The closest we've come to a court decision on this so far has been the Anthropic case, which did indeed find that training on unlicensed data falls under fair use: https://www.documentcloud.org/documents/25982181-authors-v-a...

> To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies.

reply

upvote

[-]

deleted

reply

upvote

by witx4 days ago|

[-]

If you look carefully model training is a very good relicensing exercise of your code

reply

upvote

by paganel4 days ago|

[-]

> Forcing developers to pay for models that were build on code they scraped scott-free

That's also caused by some very smart (even brilliant) developers (you can see many of them in this very thread) choosing to be oblivious about all this and bury us all under, hoping that they'll be among the last ones to go. Writing this down I realise that they maybe aren't all that smart.

reply

upvote

by thewebguyd4 days ago|

[-]

I've been saying this since the beginning, the rug pull is coming. If these models can eventually replace a human worker, there is no reason these companies won't charge (and get away with it) very close to a typical SWE salary.

It would not surprise me one bit to see anywhere from $80k-$100k/seat pricing.

reply

upvote

by andriy_koval4 days ago|

[-]

Unless there is competition (e.g. Chinese models, taking you 80% there, but costing 20x less)

reply

upvote

by larodi5 days ago|

[-]

As someone noted here recently - use the frontier models as much as u can, while you can.

reply

upvote

by dualvariable4 days ago|

[-]

AI for $20/month won't ever go away, but it won't be the absolute latest and greatest frontier model.

Most of us don't need a model that can prove the Riemann hypothesis or Goldbach's conjecture in order to get work done.

reply

upvote

by miroljub5 days ago|

[-]

Thankfully, we have Chinese models we can use for a fraction of the price.

Not everyone needs a Ferrari to go for a weekly shopping.

reply

upvote

by baq4 days ago|

[-]

A Ferrari will likely lap you when you’re racing, though, and the market and the economy is a race. You’ll be facing a question soon, or your employer will, whether to spend a significant chunk of free cash on fable-class tokens or on literally anything else instead - wages and salaries included.

reply

upvote

by iugtmkbdfil8344 days ago|

[-]

<< You’ll be facing a question soon, or your employer will

Maybe? If you talk to executives, the impression that I am getting is that they tend to be somewhat misinformed at best, which, yes, is bound to result in some really bad decisions down the road. But, and it is not a small but, the ones I did talk to ( and, amusingly, those are the ones with strong opinions ) don't seem to have a lot, um, practical exposure to this tech beyond what they heard at the watercooler. Honestly, it is kinda infuriating. And all this before we get to how companies want to say they use AI, but also keep cost down.

reply

upvote

by miroljub4 days ago|

[-]

Yeah, sure. In the same way I can see only Ferraris driving as taxis, company cars, transport vehicles, used by post, delivery services ...

You and your work are not that special, you are not participating in car races, and you don't need a Ferrari.

reply

upvote

[-]

deleted

reply

upvote

by witx4 days ago|

[-]

They are most likely shills from Anthropic, there's quite a few here everytime new models come out.

reply

upvote

by miyoji4 days ago|

[-]

That's not fair. Simon is a well-known shill for the entire AI industry, not just Anthropic.

reply

upvote

by simonw4 days ago|

[-]

What's your definition of "shill"?

reply

upvote

by miyoji3 days ago|

[-]

Merriam-Webster: noun, 1b: one who makes a sales pitch or serves as a promoter

You might want to ask the guy who said it first what he meant; I was just pointing out that your work isn't particularly Anthropic-biased, in my experience.

reply

upvote

by Jensson4 days ago|

[-]

Probably means fan, shills have undisclosed ties and I doubt he means Simon has undisclosed ties to the entire AI industry, that would be very impressive if so.

reply

upvote

by supern0va4 days ago|

[-]

Ah, yeah. I've noticed people also starting to just use "slop" to also mean "anything I see online that I don't like" now, too.

Words apparently don't mean anything anymore.

reply

upvote

by cedws4 days ago|

[-]

It’s not meant for subscription users; the subscriptions are just the gateway drug to Enterprise pricing which Anthropic intends to use to juice their numbers before IPO.

reply

upvote

by desmond13034 days ago|

[-]

Or use API billing? We have access to it at my company with no limits

reply

upvote

by simonw4 days ago|

[-]

Are you on the $100/month subscription?

reply

upvote

by joshstrange4 days ago|

[-]

I am, and I used up the entire 5 hour window in 8min using the highest thinking setting. It also ate up $15 of extra usage before I noticed.

I’ve done the same thing with opus multiple times with no issue. According to ccusage I racked up just shy of $100 of tokens using Fable.

It spun up subagents or workflows or whatever so obviously that contributed but “double opus” was not my experience. I’ve done the exact same prompt with opus on the highest setting and only once before (not even while using this prompt) hit my limits.

My prompt? I’m not a prompt wizard or anything but it was literally:

> Please review the uncommitted code in this repo for bugs/issues/code smells.

I use variations on that all the time with opus and never had issues. I figured it was a good one to kick the tires with Fable. Little did I know it would mean no more Claude Code for the next 4.5hrs (unless I wanted to pay) after this being the first time I had used CC that day (yesterday).

All in all, a pretty crappy first experience.

reply

upvote

by simonw4 days ago|

[-]

Try running this command: and see what it thinks you spent at API prices:

  uvx agentsview usage daily

Then edit the config file to add Fable pricing as described here: https://til.simonwillison.net/llms/agentsview-custom-model-p...

And run the command again. I get $126.89 for yesterday.

reply

upvote

by joshstrange4 days ago|

[-]

Hmm, I tried that and made the config file change but it didn't work for me. I just see:

    DATE        INPUT    OUTPUT   CACHE_CR  CACHE_RD   COST     MODELS
    ----        -----    ------   --------  --------   ----     ------
    2026-06-09  142015   85315    321224    6880110    $10.96   claude-fable-5, gpt-5.5, claude-haiku-4-5-20251001

I tried to filter down to just fable (or 5.5 so I could deduct it) but the `--agent` flag doesn't seem to work how I'd expect...

I think the $10.96 is coming from gpt-5.5 since I switched to it once I exhausted all my usage on CC. CCusage reports completely different numbers so I don't know which one of those is right.

Thanks for trying, for yesterday ccusage says "$92.02" for claude, which I assumed was the Fable usage.

reply

upvote

by simonw4 days ago|

[-]

If you run this:

  uvx agentsview serve

You'll get a localhost web application which makes it much easier to filter by model.

reply

upvote

by joshstrange4 days ago|

[-]

That's very interesting, I had not used agentsview at all before today and I'll have to keep that in my back pocket.

Unfortunately it's not telling the whole story. The last message from the _only_ Fable session it monitored was:

> The data layer looks clean — <REDACTED>. Now waiting on the 11-angle workflow — verification and the gap sweep run after the finders; I'll compile the full ranked findings list when it completes.

And my memory jives with that, I could see in the footer that it had spun up 11 agents (though agentsview says it used 0 subagents, don't know if it was "actually" workflows that it spun up?). It's like it didn't record the sub-sessions/sub-agents info?

I'm still shocked that my prompt (which I now can see thanks to this tool) of:

> Please review all the uncommitted work in this repo and identify any issues.

was able to burn so much, so quickly, and, most frustratingly, without actually doing anything useful because killing it was my only option lest it spend even more of "extra usage".

Overview of usage: https://cs.joshstrange.com/RjGzWVXy

Stats for that 1 session: https://cs.joshstrange.com/Fj5qv1wl

reply

upvote

by simonw4 days ago|

[-]

Can you tell in AgentsView if Fable spun up a bunch of Opus/Haiku/etc subagents that burned tokens as well?

reply

upvote

by joshstrange4 days ago|

[-]

It's as if it spun up a bunch of subagents but agentsview doesn't report on it. I see a tiny bit of Haiku use once I turn on all models (except gpt-5.5).

https://cs.joshstrange.com/z9x6SPcC

reply

upvote

by jsw974 days ago|

[-]

simonw, if you are not bumping up against the same false-positive guardrail problems and budget consumption that everyone else is, then that is something worth digging into. I would normally say that's crazy but IPOs put weird pressure on companies.

reply

upvote

by simonw4 days ago|

[-]

I've had a couple of guardrail blocks.

I've been watching my usage quota bars drop as I use the model, so I don't think I have a weird quota issue going on here.

reply

upvote

by sigbottle4 days ago|

[-]

Just tried it. Fable is extremely strong. The fact that we can't point to any concrete architectural upgrade is worrying - that means "it just gets bigger" is kind of viable.

To be clear, the jump from Opus to Fable was like the jump from pre o3 -> o3 for me. Very sharp improvement, not incremental. But that could be explained by dummy long thinking times.

It one shot a task that Opus burned hundreds of dollars on to get nowhere. Very tricky semantic refactor, got it right. Granted, again, the semantics Opus and I fleshed out 3 months prior, but Opus couldn't execute on the vision. Fable could.

Then I discussed some philosophy and it was actually both pleasant (GPT constantly "corrected" you for the sake of correction without clarification, also still often just wrong; it's like it refused to think critically about philosphy) and accurate, and actually helped resolve some deep but subtle misconceptions I had around representationalism. When talking with GPT I felt like I was talking with someone who either was sycophantic or "anything that is not absolute truth is relativism" - Fable actually discussed.

Both is exciting and kind of makes me depressed. I can definitely see why people are getting hyped about AGI again. All the models were extremely strong technically but I felt like couldn't match the developer's tacit state - Fable definitely did, and that's a basic quailty to be considered "usefully intelligent" IMO, at least to me.

Shame that it's going away in 2 weeks and probably going to be nerfed if/when it's re-released.

reply

upvote

by keybored4 days ago|

[-]

Worrying? Depressing? Why are people who are clearly enthusiasts (since they are testing the capabilities on release) always using these words? Is this a genuine interest, something that is pleasurable, or a morbid curiosity to test the bleeding edge of Humanity’s Doom? Bizarre.

reply

upvote

by sigbottle3 days ago|

[-]

It would be amazing in a perfect and just world. This technology is revolutionary. I'm very interested in LLM's because I'm personally interested in how one thinks better and comes up with better ideas - I think LLM's might elucidate some structure on that.

But technological serfdom is waiting just around the corner. Well, to be fair, I think that societal forces would've pushed us to it anyways, no AI needed, but AI is a visceral, immediate, fast-moving instantiation of it.

reply

upvote

by keybored3 days ago|

[-]

Telling and expected.

reply

upvote

by matheusmoreira5 days ago|

[-]

Fable has been producing some really good work on my end as well. Definitely better than Opus 4.8. The only problems are the cost and constant cybersecurity refusals. A single session uses up 100% of my 5h window without finishing, and that's when it doesn't get derailed by nonsensical refusals.

reply

upvote

by Georgecal4 days ago|

[-]

[dead]

reply

upvote

by sexylinux5 days ago|

[-]

It still does make errors, yes? Because it is not usable, if we need to verify everything. AI is only interesting if it can do things that humans can not do. If you can verify results because you can do it yourself, then why use AI? It will just bind highly skilled people to do verification work. Instead these people should do the actual work, results will come quicker.

So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?

If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.

reply

upvote

by zahlman4 days ago|

[-]

> AI is only interesting if it can do things that humans can not do.

AI is interesting as long as it can save time and/or money in getting an acceptable result. Anything that runs on a computer and can do "things that humans can do" will automatically end up doing things that humans won't do, simply by virtue of the fact that it runs on a machine that doesn't require sleep, doesn't get bored or demotivated, etc.

Verifying code (to a level where a responsible person is willing to take ownership for it) isn't trivial, sure; but writing the code by hand requires the same level of care, and the fact that the same person wrote it doesn't actually allow for shortcuts (if we're being properly responsible).

reply

upvote

by iwontberude4 days ago|

[-]

It doesn’t get bored or demotivated, but it also lacks interest and motivation generally so it comes with the same pitfalls of having nothing to lose and being utterly unaccountable, (e.g. destructive actions, lying, and being coercive or Machiavellian for no reason other than efficiency in achieving an arbitrary and artificial status of completion).

reply

upvote

by cindyllm4 days ago|

[-]

[dead]

reply

upvote

by cindyllm4 days ago|

[-]

[dead]

reply

upvote

by Lutger5 days ago|

[-]

Humans make mistakes too, does it mean humans are unusable? We accept as empirical fast that most production quality code has 2 - 10 bugs per 1k LoC. According to your premise, virtually all existing software is therefor unusable.

What if an LLM overall starts to make less mistakes than a medium developer, costs less than its salary and is 100 x faster? For sure, the companies that will leverage these with just a few senior devs doing prompting, testing and requirements analysis, will outcompete other organizations.

reply

upvote

by nalekberov4 days ago|

[-]

Humans make mistake then to learn from it. A really good expert would never deliberately copy-paste an obscure solution from the internet, then to ask for forgiveness later.

AI agents do that, perhaps not always, but still do. Now the question: would I trust AI without verifying its output?

reply

upvote

by camdenreslink4 days ago|

[-]

Humans also make mistakes in ways that other humans can understand or expect. Sometimes LLMs make mistakes in a way that makes you say “no human would have ever done that”.

reply

upvote

by fsniper3 days ago|

[-]

You can not trust human output without verification either. That's why you have tests, qa, staging envs, A/B tests..

reply

upvote

by CookieCrisp4 days ago|

[-]

There is plenty of work that does not need to be perfectly verified, because the risk is controlled. Prototyping a javascript game for example. Or code that runs just on your local machine where good enough is good enough. I'm sure a lot of you do super important work that needs 100% quality code all the time, but... some of us don't.

reply

upvote

by naasking4 days ago|

[-]

> Because it is not usable, if we need to verify everything.

Do you verify every line of code written by your fellow developers? I doubt it, which is strange because they make errors don't they?

What matters is the error rate. Past some threshold and they're better than senior devs who you don't supervise closely.

reply

upvote

by misja1115 days ago|

[-]

AI is like a junior developer. You have to review her code carefully but she is most definitely useful.

reply

upvote

by rllj5 days ago|

[-]

Why is your AI a she? What's up with gendering LLMs. Reminds me of Richard Dawkins calling Claude "Claudia" and insisting it to be conscious.

reply

upvote

by zahlman4 days ago|

[-]

I think GP was gendering the hypothetical junior dev, rather than the AI.

reply

upvote

by baobabKoodaa4 days ago|

[-]

The purpose of gendering into female gender like this is to signal to other leftists that you are part of their tribe.

reply

upvote

by latentsea4 days ago|

[-]

This is part of the training data now. She can hear you, you know...

reply

upvote

by anygivnthursday5 days ago|

[-]

Yeah, it makes the same old errors, being confidently wrong then sorry... I mean, it is still an LLM

reply

upvote

by OvervCW4 days ago|

[-]

One does not need to be able to create it themselves to evaluate if the output is correct. Consider for example that you can easily determine if a meal tastes delicious without being an expert chef, or the fact that NP problems are very difficult to solve but make for easily verifiable solutions.

reply

upvote

by dbbk4 days ago|

[-]

This is what tests are for.

reply

upvote

by zahlman4 days ago|

[-]

The difficult part here is supposed to be the actual compilation to create the .wasm file ? Or what am I missing here? The wheel is only a few hundred lines of code outside of the Python implementation, and it would seem that the MicroPython version of the project already demonstrates the necessary techniques for operating wasmtime.

reply

upvote

by simonw4 days ago|

[-]

Read the transcript if you want to see all of the details that make this hard: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

reply

upvote

by zahlman3 days ago|

[-]

Thanks. I had a quick run-through and I'm not really that impressed, though I'll cede that I have an atypical perspective on these kinds of issues. HN comments don't seem like the right place for a detailed critique of Claude's work here, but I've added it to my blog roadmap.

I will say that there are hardly any mis-steps in its chain of reasoning, but some odd approaches to problems and a fair bit of redundancy. Probably the most impressive part was spontaneously coming up with non-obvious issues to test, but this came with a fair handful of tests for obvious non-issues (like whether pip can extract a nested zip from a wheel without corrupting it).

reply

upvote

by sigbottle5 days ago|

[-]

Does anyone know what the architecture of Fable is? Is it harnesses? Did they solve persistent learning? What did they do?

reply

upvote

by sothatsit5 days ago|

[-]

Seems to just be a bigger model.

reply

upvote

by moffkalast4 days ago|

[-]

"Good ol' scaling, nothing beats that."

reply

upvote

by mcv3 days ago|

[-]

I have to agree. I'm working on a complex technical proposal that's a bit too far outside my expertise (I tend to submit it to actual experts for a more thorough review). I've worked with Opus and Gemini to review it and work out all the problems and inconsistencies, and I thought it was in a pretty good state.

As an additional check, I just submitted it to Fable, and it eviscerated it. Tons of inconsistencies found, issues skimmed over or ignored, too optimistic assumptions, math that doesn't really add up if you look at it in context. And as far as I can tell, all of these issues are entirely valid. I now feel embarrassed I'd already sent it to a few people for review. This clearly needs more work.

reply

upvote

by locknitpicker4 days ago|

[-]

> Clone simonw/micropython-wasm from GitHub and research how this could use a full Python as opposed to MicroPython

I might be missing something important but that doesn't seem to be an impressive task.

On a surface level it sounds like the taks requires gathering calls to MicroPython-specific libs, assess which ones are not compatible with Python, and proceed to determine how to replace the ones that are incompatible.

From that first iteration, the rest would boil down to troubleshooting the issues missed on the first shot.

I would be extremely surprised if the likes of GPT4.1 wasn't already capable of handling that task.

So, beyond Claude Fable finishing a task, what exactly is the differentiating factor?

reply

upvote

by simonw4 days ago|

[-]

Did you read the transcript? There are a whole lot of details to figure out: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

reply

upvote

by kubb5 days ago|

[-]

What can it do that Opus couldn’t?

reply

upvote

by simonw5 days ago|

[-]

Always hard to say for sure because I'm not sitting around running the exact same situations through both models in parallel to compare them.

It feels like you can give it a big chunky problem and leave it alone and it gets it done, with less questions and fewer design decisions that I wouldn't have made.

In reviewing its code I'm finding less to complain about than Opus. But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.

reply

upvote

by asdfologist5 days ago|

[-]

But you said you've been working on those problems for months, so didn't you throw those same problems at Opus?

reply

upvote

by knivets5 days ago|

[-]

He has early access to anthropic models, of course he will hype them up, so that they will keep sharing access to preview models with him (and more traffic to his website). It also does't require him to perform any rigorous analysis of model performance, just share how it feels:

> But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.

reply

upvote

by tezza5 days ago|

[-]

I did a qualitative side-by-side of Claude Fable vs Opus 4.8 vs ChatGPT 5.5

https://generative-ai.review/2026/06/claude-fable-rush-test-...

I get them to make a 3D explainer animation. You can clearly see Fable is much improved on both Opus 4.8 and ChatGPT 5.5.

Better Textures . A nifty camera follow . Humans rendered better . ... see for yourselves

reply

upvote

by ranguna4 days ago|

[-]

Honestly, they all look good

reply

upvote

by miohtama5 days ago|

[-]

Crank up more revenue for IPO

reply

upvote

by pinkgolem5 days ago|

[-]

I gave it a complete database migration of our app, opus failed hard each time... Untyped Json b for some rows, no proper normalisation, falling back asking me questions in between.

Fable just did it, clean code, one timeout with a hanging bash script, fixed a couple very old very structural bugs in the codebase

reply

upvote

by idontwantthis5 days ago|

[-]

How did you do this impressive amount of work and verify that it did it perfectly all in one day?

reply

upvote

by pinkgolem5 days ago|

[-]

I told Claude to do it yesterday evening, checked in during my nightly break.

I am not sure it's perfect, and it will need further validation

This morning I looked at code samples & checked if all unit/integration and e2e pass & perfomance tests pass

I also generated a postgres schema diagram.

Aka I did probably 2 hours of work, rest was not me

The opus try was last month

reply

upvote

by mrits4 days ago|

[-]

Nightly break? Are you from medieval Europe or a security guard that dabbles in vibe coding?

reply

upvote

by pinkgolem4 days ago|

[-]

I am from modern Europe, and that was my way of saying my nightly piss, happy to learn better wording

reply

upvote

by zek4 days ago|

[-]

if it’s of interest I’ve been working on https://github.com/HubSpot/boomslang

Which has a full build of python to WASM with a bunch of static libs built in already.

I will say I built this pre fable and actually the first build of the interpreter to WASM opus pretty much nailed, cpython has secondary support for WASM as a target since like 3.9 or something and it just pulled from that.

I’ve been meaning to write up a blog post about this sometime, building this has been pretty interesting, including using opus to run a full auto research like loop for days to hyper optimize it’s performance.

I’m hoping to use fable to power some even crazier WASM adventures tho.

reply

upvote

by alexchantavy5 days ago|

[-]

High, extra, or max?

reply

upvote

by qingcharles4 days ago|

[-]

It has a setting named "Ultracode" with a flashy little disco light when you select it. (not joking!)

https://imgur.com/a/NfIxDwN

I wanna press it, but I don't have that kind of mad, generational wealth to put a prompt through on that setting.

reply

upvote

by simonw5 days ago|

[-]

High.

reply

upvote

by Emanation4 days ago|

[-]

These transcription tasks don't seem difficult for LLMs in general.

reply

upvote

by alecco5 days ago|

[-]

I hate how the Instagram/TikTok/YouTube influencer cancer is getting into AI. With early access and all that.

It made sense for people doing proper and fair AI breakdowns waiting on an embargo, but now it's just slop I don't trust anymore.

reply

upvote

by simonw5 days ago|

[-]

I often get early access but didn't for this one, it's quite possible there's an NDA in an email somewhere that I missed and forgot to sign.

reply

upvote

by frasmiisadum5 days ago|

[-]

[dead]

reply

upvote

by what5 days ago|

[-]

[flagged]

reply

upvote

by selcuka5 days ago|

[-]

It is already disclosed [1]:

> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events.

[1] https://simonwillison.net/about/

reply

upvote

by keybored4 days ago|

[-]

HNs problem that they/we keep upvoting him.

reply

upvote

by simonw5 days ago|

[-]

My disclosures are on my blog: https://simonwillison.net/about/#disclosures

reply

upvote

[-]

deleted

reply

upvote

by sagarpatil5 days ago|

[-]

Did you hit your weekly limit ?

reply

upvote

by tomjakubowski5 days ago|

[-]

What are some reasons to consider your project instead of Pyodide?

reply

upvote

by simonw5 days ago|

[-]

It's difficult to run Pyodide inside server-side Python.

reply

upvote

by oblio5 days ago|

[-]

How much does it cost? How much did those tasks you did cost?

reply

upvote

by simonw5 days ago|

[-]

So far it's all fitting into my current $100/month Claude Max subscription. I got lucky: I had 80% of my weekly allowance left and it resets tomorrow, so I'm burning tokens to try and use it all up by then.

Update: looks like I've spent $82.92 in Fable 5 API priced tokens so far today (still all included in my subscription.)

Here's a TIL on how I'm calculating spending using AgentsView: https://til.simonwillison.net/llms/agentsview-custom-model-p...

reply

upvote

by diffuse_l5 days ago|

[-]

Seems like weekly allowance got reset back to 0%, pretty usual when they deploy new models.

reply

upvote

by EstanislaoStan5 days ago|

[-]

Have you seen Fable randomly jump from 50% session limit to 100%? That happened to me a couple hours ago. It was preceded by a bunch of errors about failing to submit a bunch of screenshots.

reply

upvote

by SyneRyder5 days ago|

[-]

I haven't noticed that, but I did notice that on a single turn of maybe a few sentences, the cache hit was somehow roughly 500K. Either that's a bug, or there are some truly massive thinking blocks or Claude Code harness system injections behind the scenes.

reply

upvote

by simonw5 days ago|

[-]

Nothing like that for me yet.

reply

upvote

by EstanislaoStan5 days ago|

[-]

I'm thinking the 1M context limit bit me here. Only on Max x5.

reply

upvote

by blackqueeriroh5 days ago|

[-]

Simon is also on Max x5

reply

upvote

by layoric5 days ago|

[-]

AFAICT come June 22, you won't be able to use your subscription for Fable 5?

reply

upvote

by ethanpil5 days ago|

[-]

Per the "Availability" section of the page, seems like should come back to all plans eventually...

* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.

* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.

* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

reply

upvote

by trueno5 days ago|

[-]

wut in tarnation

reply

upvote

by klardotsh5 days ago|

[-]

Coding plans are a (massive) subsidy. We can debate until the cows come home whether western frontier models' API pricing rates are fair, but the coding plans are all heavy discounts below those API rates meant to draw people in and get them hooked (and, ostensibly, to be useful for hobbyists or other lower-usage cases).

It's been discussed at length (on this site, on other sites, on like every blog ever, etc) that, eventually, those subsidies will end, much as the $5-10 Ubers/Lyfts I used to take from the far north end of Chicago into the Loop in 2016 would eventually end once those companies had a footing and didn't need to hook folks.

So - yeah, I mean, a v5 model launching in a year where Anthropic has a rather deeply established market and in a year where AI costs are rising from nearly all providers (sometimes for multiple reasons) seems like exactly the thing I'd expect them to pull the subsidy plug on after a launch teaser.

(Even the open-weight models sometimes do this: for example, OpenCode Zen/Go has a rotating door of free models at any given time that eventually leave the free tier and move into the paid tier once the launch day hype/marketing dies down)

reply

upvote

by oblio4 days ago|

[-]

The worst part is that Uber "only" lost about $30bn. AI will probably lose at least $300bn by the time the bubble pops. Which means that the pressure to hook and enshittify will be at least 10x as high.

Also, a fun website: https://isaiprofitable.com/ (thr numbers are probably made up)

reply

upvote

by km3r4 days ago|

[-]

Problem with that website/perspective is separating training costs from inference costs. Training is a one time cost, and while it is certainly not something you can completely ignore, it being one time changes the answer to "Is AI profitable?".

That site doesn't list the dozens of companies doing pure inference, and making a profit while doing so.

reply

upvote

by oblio4 days ago|

[-]

> That site doesn't list the dozens of companies doing pure inference, and making a profit while doing so.

Are the finances public for any of these companies? I'd love to take a look at them.

reply

upvote

by Escapade51605 days ago|

[-]

They gave everyone double usage to try it.

reply

upvote

by throwaway274485 days ago|

[-]

> VERY difficult problems

Compared to what?

reply

upvote

by zirkonit5 days ago|

[-]

But, but, how does the pelican look?!

reply

upvote

by simonw5 days ago|

[-]

See parallel thread: https://news.ycombinator.com/item?id=48464054

reply

upvote

by dz07075 days ago|

[-]

Given how bad some of the models do on somewhat similar problems, I'm sure pelican is included in training set now. Similar problems - given airplane outline and implementation constraints do painting scheme (constraints something like "it will be implemented using covering film, hence no gradients, no impossible cuts, not more than 2 colors on engine cowl, etc). Google Gemini is meh, but GPT models are just terrible, don't have Anthropic subscription at home, hence have not tested.

reply

upvote

by astrange4 days ago|

[-]

Bad pelicans are in the training set because it's read his blog post. Including a good pelican in midtraining wouldn't help the problem because you'd just produce that every time.

reply

upvote

by uncivilized5 days ago|

[-]

This looks like a toy project, not a “VERY difficult” problem like you stated.

reply

upvote

by enraged_camel5 days ago|

[-]

What does that mean? Have you never worked on extremely difficult problems as a side project?

reply

upvote

by uncivilized5 days ago|

[-]

I guess my comment got lost in translation. The project OP linked in his comment is a toy project, not a difficult problem as he led others to believe.

reply

upvote

by enraged_camel5 days ago|

[-]

So you could have done it in your sleep, with your hands tied behind your back. Got it.

(You may not realize it but simonw is one of the cofounders of Django, Python's web framework. If they find a Python problem difficult, it probably is.)

reply

upvote

by uncivilized4 days ago|

[-]

Read the log he posted. If this is very difficult, then what would you consider AI, kernel development, computer graphics, etc.?

Web development is not a domain I would consider noteworthy of making a framework given how much development there has been in that area.

reply

upvote

by cube005 days ago|

[-]

> Here's the transcript

It's frustrating that superfluous tokens are burning up our quotas:

key insight, crucially this, real engineering deltas, net assessment, definitive picture, acid tests, real limits, sharp boundary, proper patch, real root cause, big progress, actually wrong, path finagling, the catch, root cause pinned, everything passes cleanly.

reply

upvote

by 1209835 days ago|

[-]

[flagged]

reply

upvote

by supern0va5 days ago|

[-]

AI models decompose problems down into tiny pieces that exist in their training data, so in a sense, you're correct.

Though that's also what makes humans so good at solving problems as well, it turns out.

Also, slight tangent: but I do find the "clanker" insult kind of funny. I feel like it counter-intuitively makes the models sound cooler than they are, if anything. I love clankin' shit.

reply

upvote

by runarberg5 days ago|

[-]

The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less. And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces. That is how the first person to run CPython in WASM did that, and that is why the plagarism machine can now do the same (only a thousand times more lame and uninspiring).

Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.

reply

upvote

by supern0va5 days ago|

[-]

>The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less.

That may very well be true now. And in fact, this was true of more rudimentary calculations early on in computing history, where humans were definitely more efficient, particularly for more abstract mathematics. But Moore's Law comes at you fast. Even without more efficient compute, it's rather wild how much more efficient models are becoming these days just from algorithmic and training improvements.

So, maybe for now, certainly. Are you confident that will be the case in 5-10 years? And is that really your barometer for success?

>And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces.

That is certainly a limitation for now, but plenty of academic research is being done on how to address that in a more individualized way. That said, the models also have the advantage of synthesizing learnings from user interactivity back into a future release and essentially applying that globally, which is pretty neat.

There's also some cool techniques to sort of bridge the gap today, like compound engineering.

>Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.

But that's the thing: it's becoming pretty clear that the "plagiarism machine" can probably take that same problem in a prompt, having never been trained on my code, and still solve it.

In that case...maybe it doesn't feel great to have someone copy my idea. But that is certainly not plagiarism in the way you mean it. And when you put ideas out into the world, you can't be certain that someone else won't copy and remix it into something new. That's kind of how the world works already, but we're just seeing the barrier to entry decline.

reply

upvote

by runarberg5 days ago|

[-]

> Are you confident that will be the case in 5-10 years?

Yes, I am. I am very confident that general purpose digital computers will never be more efficient then human minds in generating moderately complex code.

Why am I so confident... Well, it has been over 10 years since AlphaGo beat top go player Lee Sedol. AlphaGo was able to beat the a world class go player by doing several thousands orders of magnitude more computations then Lee Sedol, and it did so by spending several orders of magnitude more energy then the top human go player. Today, over 10 years later, the top go machines are able to beat world class go players much easier, but still do so using the exact same strategy of outcomputing the humans with thousands of orders of magnitude more computations, and spending orders of magnitudes more energy.

Things did not change in the past 10 years, I see no reason why it should change 10 years from now.

reply

upvote

by supern0va5 days ago|

[-]

>Things did not change in the past 10 years, I see no reason why it should change 10 years from now.

Has it not? Why do you say that?

Also, do we still require a Deep Blue sized supercomputer for chess? :)

reply

upvote

by runarberg4 days ago|

[-]

What has not change is the strategy of throwing a gargantious amount of computations at the problem. If anything we throw more computations at more problems now than in 2016 (and in 1997 for that matter). The underlying technology is pretty much the same, just more parameters, more calculations, etc. Yes every individual calculations takes less power now then in 2016, but we make up for that by making millions of millions of more calculations, even for simpler tasks.

reply

upvote

by supern0va4 days ago|

[-]

Sure, but there will be an upper bound after which we will be close to human level performance on the vast majority of tasks, and then at that point the focus becomes efficiency (or a continuing road to superintelligence for some tasks).

But regardless, compute will get to a point where human level intelligence close to as efficient as we are. You could argue it already is today, when you factor in the resources that the average person in the west already uses in terms of their overall impact on the planet.

reply

upvote

by runarberg4 days ago|

[-]

You are describing a science fiction. There is nothing in the measured reality which indicate your predictions will come close to materialize.

I can just as well describe the future evolution of the internal combustion engine and claim it will get more and more efficient and eventually we will be able to burn oil so efficiently that our personal vehicles can fly through the atmosphere at twice the speed of sound.

There is limitations to digital computers just as there are limitations to internal combustion engines. Our brains are not digital computers. When we use our brains we don’t just do a bunch of linear algebra.

reply

upvote

by supern0va4 days ago|

[-]

>I can just as well describe the future evolution of the internal combustion engine and claim it will get more and more efficient and eventually we will be able to burn oil so efficiently that our personal vehicles can fly through the atmosphere at twice the speed of sound.

This is a silly comparison. There is a certain quantity of energy stored in oil, so we know what peak efficiency looks like. We don't actually know what amount of energy is required to solve certain problems. We quite literally have models with quite a bit of capability that can run locally on a phone today, right alongside Stockfish, for example.

And this is to say nothing of work happening now on new hardware approaches, such as Normal Computing's work on thermodynamic matrix math: https://www.normalcomputing.com/blog/a-first-demonstration-o...

That said, this feels like a strange tangent: I'm not sure it's that important that the models be as energy efficient as a human brain. We don't avoid cars because they're less energy efficient than our legs. ;)

reply

upvote

by runarberg4 days ago|

[-]

Point is that both are science fiction narratives and neither reflect reality in any way what-so-ever. How fast a car can drive and how much a LLMs can compute are bounded quantities, limited by the physical reality. In both cases we can imagine a world where this limit does not exist, but that is not the reality we live in.

This matters because unlike cars LLMs are only doing stuff we can already do using our brains, just several orders of magnitudes less efficiently. Cars can at least take us distances we would never be able to using our muscles. In comparison, if I need to compile CPython into a WASM binary I can simply download a library that does it, or copy paste code in a few seconds, for a million billionth of the energy it takes an LLM to do the same. Except when I download the library or copy-paste the code I (hopefully) attribute the original author and give them credit for their work.

reply

upvote

by supern0va4 days ago|

[-]

>Point is that both are science fiction narratives and neither reflect reality in any way what-so-ever. How fast a car can drive and how much a LLMs can compute are bounded quantities, limited by the physical reality. In both cases we can imagine a world where this limit does not exist, but that is not the reality we live in.

I'm suggesting that while LLMs are bounded by physical reality, that you actually don't know what that bound is. Just a few years ago we would have thought it a fantasy to have a conversational model run on a phone.

Even if you could compute it now, that would still be tied to current architectures. With appropriate incentives, we'll continue developing hardware to make these models more efficient to execute. It's very likely that you'll be able to run a Fable caliber coding model on your phone in the next five years.

>This matters because unlike cars LLMs are only doing stuff we can already do using our brains, just several orders of magnitudes less efficiently. Cars can at least take us distances we would never be able to using our muscles.

But that's not largely true of cars. The majority of trips are five miles or less and could easily be replaced with a bicycle. While I might personally use a bicycle, the majority choose a car to save a bit of time and effort.

So, please continue to enjoy your car, and I will continue to enjoy ready access to an LLM for a variety of other tasks. My inference energy costs are almost certainly less than your vehicle usage. ;)

reply

upvote

by scubbo5 days ago|

[-]

> The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less.

OK then - do it, faster.

> You can take comfort in the fact that a few months later some[...] developer can [solve] the same problem [using your work]

Isn't that what collaboration and sharing software is supposed to be all about?

reply

upvote

by weqwh5 days ago|

[-]

[flagged]

reply

upvote

by wlonkly5 days ago|

[-]

On one hand, "clanker" has good steampunk vibes.

On the other hand: "Stop trying to make 'clanker' happen! It's not going to happen!"

"AI slop" caught on but "clanker" did not.

reply

upvote

by supern0va5 days ago|

[-]

>"AI slop" caught on but "clanker" did not.

It caught on, sure, but not exactly in the way I expected. The wild popularity of "slop" as a term for AI eventually gave way to the genericization of the word "slop" to mean "content of low quality, regardless of source", and is seemingly being used as just a derogatory term for anything that people dislike (particularly by folks in left leaning communities). For example, I've seen people refer to (clearly human written) commentary from some political commentators as "slop".

You comment kind of reinforces the idea by the fact that you have to now say "AI slop" specifically to disambiguate it. It's kind of a fascinating little turn.

reply

upvote

by wlonkly4 days ago|

[-]

But "slop" has meant low-quality stuff for a very long time. See also "swill", both analogies to pig feed.

The earliest OED2 citation of "slop" for the sense "figurative. Nonsense, rubbish; insolence" is 1952. Slop was slop long before "AI slop" was coined, and AI slop is slop from an AI.

reply

upvote

by Chu4eeno5 days ago|

[-]

"Slop" originated on /pol/ but I'm not gong to try to tread the needle by of the rules by trying to explain it without being offensive or triggering some filter: The first related term here: https://en.wiktionary.org/wiki/AI_slop#English

reply

upvote

by blackqueeriroh5 days ago|

[-]

You have this backwards, as Simon could tell you. In fact, Simon coined “AI slop” to mean “low quality AI output.”

reply

upvote

by simonw5 days ago|

[-]

I didn't coin it myself, but I did help amplify it at the moment it started to take off.

reply

upvote

by calvinmorrison5 days ago|

[-]

claiming you aren't robophobic is the first sign of being a robophobe.

reply

upvote

by adamtaylor_135 days ago|

[-]

If you've got a real argument to make, by all means, make it. Your anger does not magically "make it so".

reply

upvote

by celdon255 days ago|

[-]

It's still a vote, and votes don't require reasons, and shouldn't be dismissed out of hand. There's a growing chorus of those who are fed up with rules for thee but not for me.

reply

upvote

by adamtaylor_134 days ago|

[-]

An emotional vote with no rationale should indeed be dismissed out of hand.

We're a society built by thought and good-will engagement. We won't get out of our "rules for thee" with less thought and less good-will engagement.

reply

upvote

by bnchrch5 days ago|

[-]

Automobiles are not interesting or useful because they're justing using trails the horses already built.

reply

upvote

by 1209835 days ago|

[-]

[flagged]

reply

upvote

by eli5 days ago|

[-]

I think this is a worthwhile argument, but you do it a disservice by spamming it in trollish comments

reply

upvote

by simonw5 days ago|

[-]

I mean yeah, in this case I fed my own open source code directly into it.

reply

upvote

by rq34qwh5 days ago|

[-]

[flagged]

reply