undefined

points

[-]

Stockfish is a machine learning system, it seems quite plausible you might be getting slapped with the silent performance degradation (https://news.ycombinator.com/item?id=48467896).

by redox995 days ago|

parent|

[-]

Them silently nerfing the model without telling you, and still fully charging for it, is a new low and should probably be illegal.

by NoahZuniga5 days ago|

parent|

[-]

Well they're not fully charging you. You get opus 4.8 pricing when it falls back to opus 4.8. Also you can disable it (and it seems like it's off by default in the api)

by LiamPowell4 days ago|

parent|

[-]

That don't fall back to Opus if their classifier thinks you might be working on anything that might be a competitor's product. It silently injects instructions into the prompt to sabotage your work. Read the policy above, it's insane to me that they're publicly admitting to this.

by xiphias25 days ago|

parent|

prev|

[-]

Not for machine learning, just for security bug finding and biology

by taurath5 days ago|

parent|

prev|

[-]

Doesn't this "silent degredation" prevent any actual evaluation of the model? If the model fails at something, this allows anyone to claim that it failed due to degradation.

by lionkor5 days ago|

parent|

[-]

Who cares if it can be evaluated independently? The majority of commenters on HN were happy to vibe code and ship products with the models we had 1-2 years ago. It continues to be laughable.

I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.

by janalsncm5 days ago|

parent|

prev|

[-]

I don’t think so? They can claim it was an act of God for all I care, but at the end of the day the model failed the task.

by anematode5 days ago|

parent|

prev|

[-]

Yup, I suspect that's what's going on

by dakolli5 days ago|

parent|

[-]

I suspect it just sucks, these models aren't useful. Stop lying to yourself.

by komali25 days ago|

parent|

prev|

[-]

No, since it's a silent failure, it's not plausible. We have to assume all results we get are the actual model performance, because, it's the actual model performance as we understand it.

Someone trying to solve similar problems will have similar results if the "silent failure" applies consistently in aggregate. So, this is the model's performance.

by janalsncm5 days ago|

parent|

prev|

[-]

It’s possible this is happening at a technical level, but I have a hard time believing this is in the spirit of what Anthropic intends to throttle. It isn’t chip design or building out a competitor to Claude.

Stockfish does use neural nets but they are tiny, on the order of 10M params. Frontier LLMs are probably 100k or 1M times larger than that.

by wgd5 days ago|

parent|

[-]

Yeah I agree this is probably outside of the intended scope of the silent sabotage mechanism, but there are plenty of reports of the "loud" safety classifier misfiring on innocuous requests and I'm not going to assume the silent failure mode is _less_ prone to false positives.

by anematode4 days ago|

prev|

[-]

Edit: Another developer seems to have found a legitimate speedup with Fable in an optimization loop. It's a nice idea, actually, and I'm duly impressed.