upvote
I've found the current cream of the crop to be quite good at resource management. I've sic'd Opus on some very gnarly lambda context bugs and it has directly improved the stability of the product I'm working on right now in a very substantial way. It couldn't quite do it entirely by itself, but with the right nudges here and there, it has absolutely accellerated the debugging work. It is particularly good at analyzing crashes and piecing together the detective work of what preconditions must exist for certain crashes to occur.
reply
I think my problem is that I’m not sure I understand whether you evals are testing language abilities or reasoning abilities.

It seems to present results as if they’re testing language abilities, but the problems seem to be reasoning problems.

reply