undefined

points

[-]

We shouldn't just measure the power of the raw LLM, harnesses matter more and more.

It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.

You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.

by sanderjd5 hours ago|

parent|

[-]

Aren't there benchmarks that measure at the harness level as well?

by theshrike7942 minutes ago|

parent|

[-]

How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.

by gbalduzzi13 hours ago|

prev|

[-]

Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate? We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?

It is a fundamentally hard problem to solve

by Wowfunhappy6 hours ago|

prev|

[-]

I'm not GP, but yes, I think it's impossible.

Take AI out of the picture for a moment. What makes someone a good coder? What makes someone intelligent? How do you evaluate those skills?

Of course we have standardized tests, and they're useful, but they're also imperfect. And they become especially imperfect when people start training for the tests specifically—which is, essentially, benchmaxxing.

We have never been able to quantitatively measure most skills to a high degree of accuracy, despite centuries of trying. That's not going to change now.

(I don't mean to anthropomorphize the LLMs, but I do think they're like humans in this way.)

by Forgeties799 hours ago|

prev|

[-]

The reason we can’t capture it empirically is that nobody truly knows exactly what we are supposed to be using these tools for or how they are going to operate. We are still fitting squares into holes with them. We are told to treat them like some bespoke tool for coding, shopping, tech-support, etc. But it is not actually purpose built for any of these things.

When I use a calculator, I know exactly what it does and what it is supposed to do. It always gives me a verifiable, predictable result. If I input “8+8” 10,000x it will give me “16” 10,000x outside of incredibly fringe edge cases/bugs. I can’t say the same for LLMs