Hacker News
new
past
comments
ask
show
jobs
points
by
jwpapi
8 hours ago
|
comments
by
SatvikBeri
6 hours ago
|
[-]
We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus:
https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...
Models have improved dramatically even with the same harness
reply