undefined

points

by keepamovin9 hours ago |

comments

by Barbing8 hours ago|

[-]

Interesting, little different than this other site I saw on HN this week:

https://marginlab.ai/trackers/claude-code

by 7 hours ago|

parent|

[-]

deleted

by siliconc0w9 hours ago|

prev|

[-]

I would love to see a stable test over time with a hold out set of easy/medium/hard challenges. I, like many others, have noticed a large drop in recent performance w/ Claude Opus (and Sonnet) and more sites like these would hold the labs more accountable to sneaky backend changes that nerf/degrade performance.

by bisonbear9 hours ago|

prev|

[-]

working on something similar to evaluate model performance over time using tasks based on your own code. obviously this is still susceptible to the same hacking mechanics documented here, but at a local level, it's easier to detect/fix, and should give a stronger signal of subjective harness/agent/context performance than these large generic benchmarks

also I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...