undefined

points

[-]

Meta did a bunch of mistakes, and look like Zuckerberg spent a lot of money on talent and made big swings to change it (that happened about a year ago)

I think it’s unrealistic to expect them to come back from that pit to the top in one year, but I wouldn’t rule them out getting there with more time. That’s a possible future. They have the money and Zuckerberg’s drive at the helm. It can go a long way.

by solenoid09375 hours ago|

prev|

[-]

It's benchmaxxed.

If they actually matched Opus 4.6 on such a short timeline, it would have been mighty impressive. (Keep in mind this is a new lab and they are prohibited from doing distills.)

by throwaw125 hours ago|

parent|

[-]

how do you know it's benchmaxxed?

by solenoid09375 hours ago|

parent|

[-]

Friends at Meta with access to the model + personal experience at Meta.

Meta's performance process is essentially "show good numbers or you're out." So guess what people do when they don't have good numbers? They fudge them. Happens all across the company.

by luma4 hours ago|

parent|

prev|

[-]

For one, they aren't using the latest version of many of the benchmarks. eg, ARC-AGI 2 and not 3, etc.

by prodigycorp5 hours ago|

parent|

prev|

[-]

meta's benchmaxing tendencies are well known. llama4 was mega benchmaxxed, there's nothing that suggests to me that meta's culture has changed.

by spindump89303 hours ago|

parent|

[-]

Re: changes, there's been enormous turnover in AI organizations, and in theory this one was developed by a "new" org. Whether that means less or more benchmaxxing is anyone's guess.

by coffeebeqn5 hours ago|

prev|

[-]

Matching Opus 4.6 would be pretty good? It’s the SOTA actually available model

by reissbaker4 hours ago|

parent|

[-]

Muse Spark doesn't even match GLM-5.1 on most benchmarks. And GLM is open source!

by impulser_5 hours ago|

prev|

[-]

It's not even on par with Sonnet. It's on par with open source models and it not even open source and sit behind a private preview API.

Might as well not release anything.

by CuriouslyC2 hours ago|

prev|

[-]

Anthropic has just been focused on coding/terminal work longer mostly, and their PRO tier model is coding focused, unlike the GPT and Gemini pro tier models which have been optimized for science.

Their whole "training the LLM to be a person" technique probably contributes to its pleasant conversational behavior, and making its refusals less annoying (GPT 5.2+ got obnoxiously aligned), and also a bit to its greater autonomy.

Overall they don't have any real moat, but they are more focused than their competition (and their marketing team is slaying).

by zozbot2341 hours ago|

parent|

[-]

Autonomy for agentic workflows has nothing to do with "replying more like a person", you have to refine the model for it quite specifically. All the large players are trying to do that, it's not really specific to Anthropic. It may be true however that their higher focus on a "Constitutional AI"/RLAIF approach makes it a bit easier to align the model to desirable outcomes when acting agentically.

by wotsdat5 hours ago|

prev|

[-]

[dead]

by username2235 hours ago|

prev|

[-]

Facebook is working with the talent that can’t find a job at some other company. It doesn’t surprise me they ship mediocrity.

by zozbot2345 hours ago|

prev|

[-]

> has some secret sauce

Yup, it's called test-time compute. Mythos is described as plenty slower than Opus, enough to seriously annoy users trying to use it for quick-feedback-loop agentic work. It is most properly compared with GPT Pro, Gemini DeepThink or this latest model's "Contemplating" mode. Otherwise you're just not comparing like for like.

by throwaw125 hours ago|

parent|

[-]

> it's called test-time compute.

Why can't others easily replicate it?

by coder685 hours ago|

parent|

[-]

I have not delved into the theory yet but it seems that the smaller open-source models do this already to an extent. They have less parameters, but spend much more time/tokens reasoning, as a way to close the performance gap. If you look at "tokens per problem" on https://swe-rebench.com/ it seems to be the case at least.