undefined

points

by gchamonlive6 hours ago |

comments

by Retr0id5 hours ago|

[-]

You mean the Claude output? The same claude that has "regressed to the point it cannot be trusted"?

by gchamonlive4 hours ago|

parent|

[-]

What you saying the OP fabricated/hallucinated the evidence?

by Retr0id4 hours ago|

parent|

[-]

I'm just saying it's epistemically unrigorous to the point of being equivalent to anecdata.

by gchamonlive4 hours ago|

parent|

[-]

How should one conduct such a rigourously reproducible experiment when LLMs by nature aren't deterministic and when you don't have access to the model you are comparing to from months ago?

by Retr0id4 hours ago|

parent|

[-]

Something like this: https://marginlab.ai/trackers/claude-code/ (see methodology section)

by gchamonlive3 hours ago|

parent|

[-]

Kudos for the methodology. The only question I can come up with is that if the benchmarks are representative of daily use.

Anecdotal or not, we see enough reports popping up to at least elicit some suspion as to service degradation which isn't shown in the charts. Hypothesis is that maybe the degradation experienced by users, assuming there is merit in the anecdotes, isn't picked up by the kind of tracking strategy used.

by Retr0id3 hours ago|

parent|

[-]

It's not my methodology to be clear, but they have picked up actual regressions that happened in the past - e.g. https://news.ycombinator.com/item?id=46815013

by 4 hours ago|

parent|

prev|

[-]

deleted