undefined

upvote

points

by teiferer5 days ago |

upvote

by zylepe4 days ago|

[-]

Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

reply

upvote

by aspenmartin4 days ago|

[-]

If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.

reply

upvote

by ElevenLathe4 days ago|

[-]

> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

reply

upvote

by aspenmartin4 days ago|

[-]

> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.

reply

upvote

by andai4 days ago|

[-]

Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

reply

upvote

by aspenmartin4 days ago|

[-]

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.

reply

upvote

by Eisenstein4 days ago|

[-]

Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...

reply

upvote

by aspenmartin3 days ago|

[-]

Yes but so what right? This is a problem for both alignment evals and actual cheating (e.g. someone forgot to delete .git history and the model was able to back out the original PR, or they can decrypt something by finding a key, etc), but both of these are beyond the scope of what I'm talking about. The impact on these evals that are affected is small, and so what if you know you're being evaled when I ask you to give a new proof for a conjecture? I just care whether or not you can do it...

reply

upvote

by Eisenstein3 days ago|

[-]

I'm not responding to 'it doesn't matter if they know they are being evaluated', because that isn't what you mentioned in your comment. What you said was 'they won't know they are being evaluated', which is what my reply addressed.

reply

upvote

by aspenmartin3 days ago|

[-]

Oh ok well then you’re definitely right about that, they can tell and sometimes it really matters (I can’t remember if it was SWEBench or not but there was a major benchmark where the models were just inspecting git histories that were leaked into the dataset). The more insidious one is alignment but idk alignment research that well to know if this is a big deal or not.

reply

upvote

by ElevenLathe4 days ago|

[-]

I'm not suggesting anyone is doing anything, just stating the objective fact that it is definitely possible for closed-weight model developers, and would be super hard to detect outside of this limit scenario you posit, where it is provably impossible for the provider to have seen the benchmark before it was run (which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking).

To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

reply

upvote

by aspenmartin3 days ago|

[-]

Its not a limit scenario is my point: these models are evaluated constantly, new benchmarks both public and proprietary are in constant development, benchmarks are not always static either, they can often times be living benchmarks that update over time.

You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.

> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking

yes this is incredibly common. I'm not talking about hypothetical scenarios.

> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.

reply

upvote

by teiferer4 days ago|

[-]

> This is...just incredibly conspiratorial and a bit silly.

Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.

reply

upvote

by aspenmartin3 days ago|

[-]

They don't have control over measurement. Consider also it's easy to figure this out and it creates a scandal. Like I said, consider Llama 4 which a lot of people pointed out used a custom model in LMArena to inflate their scores; its never clear what the true underlying story for this, but regardless that model release spurred billions of dollars of spending on new talent and a complete gutting of that org.

These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.

reply

upvote

by bcrosby954 days ago|

[-]

Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

reply

upvote

by aspenmartin4 days ago|

[-]

Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.

reply

upvote

by joquarky4 days ago|

[-]

Imagine unironically starting your comment with "Um" in 2026.

reply

upvote

by jaapz4 days ago|

[-]

As opposed to your incredibly useful contribution to this thread, thanks!

reply

upvote

by aspenmartin4 days ago|

[-]

You don't have to imagine!

reply

upvote

by naikrovek4 days ago|

[-]

ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.

reply

upvote

by aspenmartin4 days ago|

[-]

You are literally describing a benchmark

reply

upvote

by nahrin4 days ago|

[-]

100% agree on this! These new models best performance is always experienced in the first hour of communicating with them. If you have a specific problem with a clear goal in mind, then you have one hour to get the best out of any AI model. Personally, every time I took an AI suggestion, I walked through a wall sideways. AI is hands down a smart technology that throws dictionary vibes!

reply

upvote

by p-e-w4 days ago|

[-]

Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.

reply

upvote

by bluGill4 days ago|

[-]

> students are evaluated by teachers with more knowledge and experience than them

This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

reply

upvote

by JadeNB4 days ago|

[-]

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration)

I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)

reply

upvote

by cycomanic4 days ago|

[-]

It certainly is true in physics and engineering that a PhD student at least half way through their PhD should know more than there supervisor about their topic (and usually much earlier). Even a Masters thesis project student should understand the intricacies of their project better than their supervisor. I'm speaking as someone who has supervised a significant number of both PhD and Masters students.

reply

upvote

by camdenreslink4 days ago|

[-]

The original post said “in college”. It might be true for PhD candidates halfway through their program, but that’s like 0.5% of college students. The vast majority of students are leagues behind their instructors in domain knowledge.

reply

upvote

by bluGill4 days ago|

[-]

I wouldn't say leagues behind, but otherwise I think we are on the same page, though I guess I worded it wrong. It is common for a couple students in any class to know more than the instructor in some niche part of the field even though the instructor has much more knowledge overall.

reply

upvote

by JadeNB4 days ago|

[-]

Yes, I intentionally left out the next part of the quote about graduate school, since that seems more accurate. I was disputing only the part that I took to be pertaining to undergraduate education. The full quote is:

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

reply

upvote

by cycomanic4 days ago|

[-]

Ah apologies, that's what I get for skim reading and kneejerk replying. I completely agree with you, undergrads are highly unlikely to know more about a subject than their professor (obviously there can always be exceptions).

reply

upvote

by teiferer4 days ago|

[-]

A grad student is evaluated by how well they are capable of following scientific procedures, communicated their results and have a sufficiently broad knowledge foundation. All that can easily be verified by a professor in a related field since they are very experienced in all those things. They don't actually need to be experts in the specific narrow topic the student has become the world expert in.

reply

upvote

by aspenmartin4 days ago|

[-]

> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

reply

upvote

by Jensson4 days ago|

[-]

> How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.

But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.

reply

upvote

by aspenmartin4 days ago|

[-]

Yet human judgement isn’t subject to side effects like fluency and persuasiveness? It’s like everyone in this thread dismisses benchmarks and then…describes a crappy benchmark.

Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.

Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?

reply

upvote

by andai4 days ago|

[-]

I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)

reply

upvote

by ishurand44 days ago|

[-]

The only one I see that thinks it is claude other than claude itself is the GLM series.

reply

upvote

by throw109204 days ago|

[-]

I have screenshots of Deepseek V4 doing this too - in a non-Claude-Code harness.

reply

upvote

by andai3 days ago|

[-]

Also MiMo...

reply

upvote

by Wowfunhappy4 days ago|

[-]

Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.

reply

upvote

by naikrovek4 days ago|

[-]

> real life resists those kinds of measurements

no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.

Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.

"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.

zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.

reply

upvote

by Wowfunhappy3 days ago|

[-]

Don't you think this applies to LLMs too?

reply

upvote

by tsss4 days ago|

[-]

> determine quantitatively forever whether Rust is a superior programming language to Go

Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.

reply

upvote

by lukan4 days ago|

[-]

So .. where can we read about the results?

reply

upvote

by karunamurti4 days ago|

[-]

ugghh, benchmarks?

reply

upvote

by lukan3 days ago|

[-]

Benchmarks about the superior programming language?

You mean benchmarks about the programming language that produce the fastest code?

That is not really the same.

reply

upvote

[-]

deleted

reply

upvote

by Certhas5 days ago|

[-]

There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

reply

upvote

by johnisgood5 days ago|

[-]

Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

reply

upvote

by lanstin4 days ago|

[-]

"Check your work for mistakes after the first draft" maybe :)

reply

upvote

by hardwaregeek4 days ago|

[-]

Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

reply

upvote

by AlecSchueler4 days ago|

[-]

No, relative performance between Python and Java can absolutely be measured.

reply

upvote

by skywhopper4 days ago|

[-]

Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.

reply

upvote

by andai4 days ago|

[-]

I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

I know how stupid that sounds but it's true.

Well what do they say... "If it sounds stupid but it works, then it's not stupid!"

reply

upvote

by bfrog4 days ago|

[-]

How do you measure the performance of people? This is subjective and biased every time.

reply

upvote

by stray4 days ago|

[-]

I have a couple projects that have completely stalled because none of the frontier models could advance any further with them - I'm going to give fable a try at them this coming weekend.

I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.

ymmv

reply

upvote

by theshrike794 days ago|

[-]

Yes, words matter.

My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".

AV professionals always say "timecode" - timestamp is a programming term.

Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".

reply

upvote

by contextfree5 days ago|

[-]

fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

reply

upvote

by contextfree3 days ago|

[-]

Addendum: Interestingly, it ended up taking me about the same amount of time - 8 hours or so - to hit the "vibe limit" with Fable. But in that amount of time I made about 5-10x as much progress. So my feelings are:

1. It's exponentially better

2. yet, somehow, hand coding still isn't dead, at least for me

reply

upvote

by thewhitetulip5 days ago|

[-]

How many $ do you guys spend when your session runs for 30min? What's the total budget?

reply

upvote

by contextfree3 days ago|

[-]

I just have a regular Claude subscription and keep within its usage limits

reply

upvote

by thewhitetulip3 days ago|

[-]

But isn't running Claude models for 30min expensive? Or is Claude Code not expensive?

I use Cursor and if I ran Claude models for 30min I might exhaust my mobthly budget! Maybe it's an API billing issue though

reply

upvote

by contextfree2 days ago|

[-]

It's included free with subscription plans until June 22. I get about 2 hours a day of usage through Claude Code until I hit my usage limit. I just use it for 2 hours then wait for the next day.

reply

upvote

by solumunus4 days ago|

[-]

Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.

reply

upvote

[-]

deleted

reply

upvote

by ElFitz5 days ago|

[-]

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

reply

upvote

by farley134 days ago|

[-]

I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.

reply

upvote

by ElFitz4 days ago|

[-]

[dead]

reply

upvote

[-]

deleted

reply

upvote

by theshrike794 days ago|

[-]

IMO comparing different models is like comparing songs or paintings or modern art.

There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.

Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.

You can also do benchmarks but how do you measure the output of those?

The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.

reply

upvote

by locknitpicker4 days ago|

[-]

> IMO comparing different models is like comparing songs or paintings or modern art.

I don't think this is that subjective or vague.

There are a couple of crisp metrics that can be used to evaluate a model:

- given a prompt, does it finish a task (times X tasks)

- how much did it cost to finish the task

- how long did it took?

If all models are able to handle a class of tasks, they perform equally well.

If a model costs much more to finish a task, it is worse than other models.

If a model takes longer to finish a task, it is worse than other models.

The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.

reply

upvote

by theshrike793 days ago|

[-]

"Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

Or just that it's so much cheaper that the cost/benefit ratio is better?

Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

reply

upvote

by locknitpicker3 days ago|

[-]

> "Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.

And that says it all.

> Or just that it's so much cheaper that the cost/benefit ratio is better?

That too is another definition of quality, isn't it?

If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.

> Also "finish a task" is also subjective.

No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.

> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".

I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.

And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.

reply

upvote

by vonneumannstan4 days ago|

[-]

The first thing in the release page is benchmark results...

https://www.anthropic.com/news/claude-fable-5-mythos-5

reply

upvote

by ivanovm4 days ago|

[-]

The benchmarks are now the equivalents of SAT/ACT/other standardized exams for humans. They are directionally quite predictive, but with plenty of outcome variance on the margins

reply

upvote

by torginus4 days ago|

[-]

Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed

reply

upvote

by lqstuart4 days ago|

[-]

It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ

reply

upvote

by kmacdough5 days ago|

[-]

I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.

reply

upvote

by alecco4 days ago|

[-]

> These comparisons are all gut feelings.

https://simonwillison.net/about/#disclosures

"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

But I'm totally unbiased on my gut-feeling posts, trust me bro.

-- AI influencers.

reply

upvote

by simonw4 days ago|

[-]

Anthropic didn't give me early access to this model, shouldn't that bias me against it?

reply

upvote

by deagle504 days ago|

[-]

You kinda proved the point...

reply

upvote

by simonw4 days ago|

[-]

How?

reply

upvote

by deagle504 days ago|

[-]

If you're that easily biased then why trust your assessment?

reply

upvote

by simonw4 days ago|

[-]

Where did I say I was biased?

reply

upvote

by deagle504 days ago|

[-]

the hypothetical you presented above

reply

upvote

by simonw4 days ago|

[-]

It was a hypothetical. How does presenting a hypothetical equate to proving anyone's point here?

reply

upvote

by deagle504 days ago|

[-]

you implied that not being given early access could bias you in the other direction. Which in my opinion would demonstrate that you are easily biased. Which would then call into question any opinion you share about the subject.

reply

upvote

by simonw4 days ago|

[-]

Someone accused me of being biased in favor of model providers who give me early access, after I praised Fable's performance.

I said "Anthropic didn't give me early access to this model, shouldn't that bias me against it?"

I was explicitly pointing out that their failure to give me early access had not, in this case, lead to me reviewing their model poorly.

I try very hard not to let things like early access affect my reviews of models. I was hoping this particular situation could help illustrate that.

reply

upvote

by munksbeer4 days ago|

[-]

Don't feed the trolls Simon.

reply

upvote

by alias_neo4 days ago|

[-]

This isn't some random dipshit, this is Simon Willison[1]. He has a bit more cred than some "AI influencer".

[1]https://en.wikipedia.org/wiki/Simon_Willison

reply

upvote

by bigboggerlogins4 days ago|

[-]

[dead]

reply

upvote

by tezza5 days ago|

[-]

[flagged]

reply

upvote

by 1dom5 days ago|

[-]

[flagged]

reply

upvote

by tezza5 days ago|

[-]

check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.

Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.

If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.

How is a side by side direct comparison NOT precise?

[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix

.

[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.

reply

upvote

by 1dom5 days ago|

[-]

I did browse and check the links. This was the first link I went to: https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... as it's the main one on the page, and I saw more qualitative stuff without quantitative stuff.

I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.

I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.

Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.

reply

upvote

by tezza5 days ago|

[-]

There are benchmarks if you want quantitative results. Mine is qualitative, and clearly billed as such. Comparison and contrast still possible.

reply

upvote

by ksec3 days ago|

[-]

My good lord Tezza. You still have claim and composed response after that sort of insults being throw at you. Haven't seen one this bad for quite sometime on HN. I hope you have a great day.

reply

upvote

by lionkor5 days ago|

[-]

This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.

reply

upvote

by user439285 days ago|

[-]

I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.

In my opinion, if one cannot express themselves civilly, they should refrain from commenting.

reply

upvote

by 1dom4 days ago|

[-]

I disagree. I wouldn't consider it unhinged. I'm clearly aware of my own frustration. It's also relatively civil, since I was able to temper it with appropriate apologies and acknowledgements. Many other people agree and support the sentiment of what I'm saying.

AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.

It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.

reply

upvote

by user439284 days ago|

[-]

I found the website you ranted about interesting, comparing the quality of the visualization between the different models.

I don't think it was "a huge waste of time" or needed your rant.

You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.

What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.

reply

upvote

by 1dom4 days ago|

[-]

This is slop, in the sense that it looks like a lot of useful work and effort, and AI is heavily involved, and it was offered up when the opposite was requested, meaning it's not at all helpful in this context.

I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.

I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.

The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.

reply

upvote

by jgilias4 days ago|

[-]

Oh boy. I see this so much.

reply

upvote

by bigboggerlogins4 days ago|

[-]

[dead]

reply

upvote

by throw109204 days ago|

[-]

> I reads like an unhinged rant about AI

> if one cannot express themselves civilly

It was neither unhinged nor uncivil. Maybe you responded to the wrong comment by accident?

> they have permission to insult someone's competence and work

If it's AI, it's not your work. And even if it was - criticism of your work is not a personal insult. This criticism is flatly invalid.

reply

upvote

by user439283 days ago|

[-]

You think it was civil when the comment started with:

> this post gets me irrationally irritated and makes me want to shake you and shout

Yes, criticism of my work would not generally be a personal insult.

However, if you were to call my work 'slop', and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level. This is not a civil or respectful way to talk to someone.

reply

upvote

by throw109203 days ago|

[-]

> You think it was civil when the comment started with:

>> this post gets me irrationally irritated and makes me want to shake you and shout

Did you read the rest of the comment? The rest of it is civil. It's normal for people to start by saying something like "this makes me frustrated" as a preface to indicate their feelings, and then not actually act frustrated and instead calmly work through their thoughts. That is a meatspace social convention (not just an online one) - are you not aware of it?

> However, if you were to call my work 'slop'

And, as previously established, if you use AI, it's not your work.

> and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level

...and those are still criticisms of your work, not yourself.

The actual problem here is that you are taking offense to things that are not offensive, not that the parent poster was being uncivil. Thinking that calling someone "inexperienced" is a personal insult is absolutely insane. That's a wildly miscalibrated sense of how social dynamics work and what it actually means to insult someone.

reply

upvote

by leodavi4 days ago|

[-]

How is this meaningfully different than simonw's pelicans riding a bicycle? If anything, this seems to be of a higher caliber?

reply

upvote

by 1dom4 days ago|

[-]

simonw's pelicans probably wouldn't get posted in response to a request for a more quantitative analysis.

You and others are right though, that there's potentially interesting or enjoyable stuff in there (maybe I should have lead with that?). It's just a large volume of it is not useful in response to a question specifically looking for more quantitative or detailed usage analysis.

reply

upvote

by bigboggerlogins4 days ago|

[-]

[dead]

reply

upvote

by thewhitetulip5 days ago|

[-]

It feels like hand written software will now be "bespoke"

reply

upvote

by disgruntledphd24 days ago|

[-]

artisanal, hand-crafted software.

reply