upvote
1000% this, this was us internally testing if our harness worked, the motivation was never to test them in-depth 1v1. We were just really shocked at the results, there’s a lot more work to do here.
reply
Can you run Claude Opus through the same Pydantic harness and add the cost to the benchmark result table? An isolated price is meaningless.
reply