upvote
Aren't there benchmarks that measure at the harness level as well?
reply
How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.
reply