upvote
Interesting idea. The metric I'd intuitively want to see is low variance between harnesses for a smarter model. But if a large sample of models statistically outperformed with a certain harness, that's indeed a valuable signal for a developer.
reply