undefined

upvote

points

by dpe826 hours ago |

upvote

by madihaa6 hours ago|

[-]

The most exciting part isn't necessarily the ceiling raising though that's happening, but the floor rising while costs plummet. Getting Opus-level reasoning at Sonnet prices/latency is what actually unlocks agentic workflows. We are effectively getting the same intelligence unit for half the compute every 6-9 months.

reply

upvote

by mooreds3 hours ago|

[-]

> We are effectively getting the same intelligence unit for half the compute every 6-9 months.

Something something ... Altman's law? Amodei's law?

Needs a name.

reply

upvote

by merlindru34 minutes ago|

[-]

How about More's law - because we keep getting "more" compute at a lower cost?

reply

upvote

by nimonian1 hours ago|

[-]

Moore's law lives on!

reply

upvote

by turnsout2 hours ago|

[-]

This is what excited me about Sonnet 4.6. I've been running Opus 4.6, and switched over to Sonnet 4.6 today to see if I could notice a difference. So far, I can't detect much if any difference, but it doesn't hit my usage quota as hard.

reply

upvote

by amelius6 hours ago|

[-]

> The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

Yeah, but RAM prices are also back to 1990s levels.

reply

upvote

by mrcwinn6 hours ago|

[-]

Relief for you is available: https://computeradsfromthepast.substack.com/p/connectix-ram-...

reply

upvote

by isoprophlex5 hours ago|

[-]

You wouldn't download a RAM

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by mikkupikku5 hours ago|

[-]

I knew I've been keeping all my old ram sticks for a reason!

reply

upvote

by dpe826 hours ago|

[-]

simonw hasn't shown up yet, so here's my "Generate an SVG of a pelican riding a bicycle"

https://claude.ai/public/artifacts/67c13d9a-3d63-4598-88d0-5...

reply

upvote

by coffeebeqn6 hours ago|

[-]

We finally have AI safety solved! Look at that helmet

reply

upvote

by 1f60c6 hours ago|

[-]

"Look ma, no wings!"

:D

reply

upvote

by thinkling5 hours ago|

[-]

For comparisonI think the current leader in pelican drawing is Gemini 3 Deep Think:

https://bsky.app/profile/simonwillison.net/post/3meolxx5s722...

reply

upvote

by konart5 hours ago|

[-]

My take (also Gemini 3 Deep Think): https://gemini.google.com/share/12e672dd39b7

Somehow it's much better now.

reply

upvote

by jazzyjackson4 hours ago|

[-]

I’m not familiar with Gemini, isn’t this just a diffusion model output? The Pelican test is for the llm to produce SVG markup.

reply

upvote

by konart4 hours ago|

[-]

Yeah, I was so amazed by the result I didn't even realize Gemini used Nano Banana while producing the result.

reply

upvote

by kingbob00058 minutes ago|

[-]

Is that actually better? That pelican has arms sprouting out of its wings

reply

upvote

by AstroBen6 hours ago|

[-]

if they want to prove the model's performance the bike clearly needs aero bars

reply

upvote

by dyauspitr5 hours ago|

[-]

Can’t beat Gemini’s which was basically perfect.

reply

upvote

by ge962 hours ago|

[-]

I sent Opus a photo of NYC at night satellite view and it was describing "blue skies and cliffs/shore line"... mistral did it better, specific use case but yeah. OpenAI was just like "you can't submit a photo by URL". Was going to try Gemini but kept bringing up vertexai. This is with Langchain

reply

upvote

by simlevesque6 hours ago|

[-]

The system card even says that Sonnet 4.6 is better than Opus 4.6 in some cases: Office tasks and financial analysis.

reply

upvote

by justinhj6 hours ago|

[-]

We see the same with Google's Flash models. It's easier to make a small capable model when you have a large model to start from.

reply

upvote

by karmasimida6 hours ago|

[-]

Flash models are nowhere near Pro models in daily use. Much higher hallucinations, and easy to get into a death sprawl of failed tool uses and never come out

You should always take those claim that smaller models are as capable as larger models with a grain of salt.

reply

upvote

by justinhj3 hours ago|

[-]

Flash model n is generally a slightly better Pro model (n-1), in other words you get to use the previously premium model as a cheaper/faster version. That has value.

reply

upvote

by karmasimida2 hours ago|

[-]

They do have value, because they are much much cheaper.

But no, 3.0 flash is not as good as 2.5 pro, I use both of them extensively, especially in translation. 3.0 flash will confidently mistranslate some certain things, while 2.5 pro will not.

reply

upvote

by justinhj16 minutes ago|

[-]

Totally fair. Translation is one of those specific domains where model size correlates directly with quality, and no amount of architectural efficiency can fully replace parameter count.

reply

upvote

by iLoveOncall6 hours ago|

[-]

Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.

reply

upvote

by jwolfe6 hours ago|

[-]

For cases where 4.5 already met the bar, I would expect 50% preference each way. This makes it kind of hard to make any sense of that number, without a bunch more details.

reply

upvote

by gnatolf2 hours ago|

[-]

Good point. So much functionality gets commoditized, we have to move goalposts more or less constantly.

reply

upvote

by estomagordo6 hours ago|

[-]

Why is it wild that a LLM is as capable as a previously released LLM?

reply

upvote

by crummy6 hours ago|

[-]

Opus is supposed to be the expensive-but-quality one, while Sonnet is the cheaper one.

So if you don't want to pay the significant premium for Opus, it seems like you can just wait a few weeks till Sonnet catches up

reply

upvote

by ceroxylon5 hours ago|

[-]

Strangely enough, my first test with Sonnet 4.6 via the API for a relatively simple request was more expensive ($0.11) than my average request to Opus 4.6 (~$0.07), because it used way more tokens than what I would consider necessary for the prompt.

reply

upvote

by svachalek3 hours ago|

[-]

This is an interesting trend with recent models. The smarter ones get away with a lot less thinking tokens, partially to fully negating the speed/price advantage of the smaller models.

reply

upvote

by estomagordo2 hours ago|

[-]

Okay, thanks. Hard to keep all these names apart.

I'm even surprised people pay more money for some models than others.

reply

upvote

by tempestn6 hours ago|

[-]

Because Opus 4.5 was released like a month ago and state of the art, and now the significantly faster and cheaper version is already comparable.

reply

upvote

by micw5 hours ago|

[-]

"Faster" is also a good point. I'm using different models via GitHub copilot and find the better, more accurate models way to slow.

reply

upvote

by stavros5 hours ago|

[-]

Opus 4.5 was November, but your point stands.

reply

upvote

by tempestn3 hours ago|

[-]

Fair. Feels like a month!

reply

upvote

by simianwords6 hours ago|

[-]

It means price has decreased by 3 times in a few months.

reply

upvote

by Retr0id6 hours ago|

[-]

Because Opus 4.5 inference is/was more expensive.

reply