undefined

points

by goyozi13 hours ago |

comments

by Eridrus3 hours ago|

[-]

Nobody releases numbers that show them to be worse than competitors lol.

This even applies to OpenAI & Anthropic who don't even eval on the same datasets a lot of the time.

by NiloCK11 hours ago|

prev|

[-]

I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).

Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.

by Aurornis11 hours ago|

prev|

[-]

I think the argument is that trying to suggest that they’re close to N months from SOTA.

Realistically I assume they hope readers don’t notice the fine details.

The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.

by epolanski11 hours ago|

parent|

[-]

> Realistically I assume they hope readers don’t notice the fine details.

The pool of people reading such articles while ignoring such details can't be big.

by Aurornis11 hours ago|

parent|

[-]

I disagree. Most people skim articles, not read them deeply.

On Hacker News I wonder if most people even opened the article at all most times.

by hadlock7 hours ago|

parent|

[-]

Slashdot coined RTFA in the 90s, what you're suggesting isn't a new concept by any measure

e: which itself is a modification of RTFM from usenet

by htrp12 hours ago|

prev|

[-]

I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).

if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.

by beydogan11 hours ago|

prev|

[-]

honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.

by hypercube3310 hours ago|

parent|

[-]

4.6 was an awful experience the month I used it right after launch where it didn't ask anything just made assumptions and went on its merry way. 4.5 and 4.7 don't do that for me but 4.7 eats my quota for breakfast so I've been avoiding using it because I like to have it for more than an hour a day.

by goyozi10 hours ago|

parent|

[-]

I feel like I had the best and worst ~month experience on 4.6. Initially when it came out, it seemed to ask good questions and genuinely do well on complex tasks. From about mid-March it was absolutely abysmal, it seemed to assume the stupidest answer/angle for everything and make weird mistakes. 4.7 seems decent so far but usage hurts - at some point my company switched me to standard seat and I used up 80% of my session usage in 1 prompt. I got my premium seat back since but I think pro/standard plan + opus 4.7 is unusable for daily driving.

by verdverm9 hours ago|

parent|

prev|

[-]

That experience is also likely tied to the claude harness around the model, and not being as tuned right after model release. They iterate on this and different models need different words (unfortunately...).

by hmokiguess12 hours ago|

prev|

[-]

this puzzles me too, I want to know

by maelito12 hours ago|

prev|

[-]

Marketing.

by pulse-dev11 hours ago|

prev|

[-]

[dead]