upvote
> The first thing I do with new agent driven project is set up quality checks. Linters, test frameworks, static analysis, etc

I do this too, but then I sit and observe how agent gets very creative by going around all of these layers just to get to the finish line faster.

Say, for example, if I needlessly pass a mutable reference and the linter screams at me, I know it's either linter is wrong in this case, or I should listen to it and change the signature. If I make the lazy choice, I will be dissatisfied with myself, I might even get scolded, or even fired if I keep making lazy choices.

LLM doesn't get these feelings.

LLM will almost always go for silencing it because it prevents it from reaching the 'reward'. If you put guardrails so that LLM isn't allowed to silence anything, then you get things like 'ok, I'll just do foo.accessed = 1 to satisfy the linter'.

Same story with tests. Who decides when it's the test that should be changed/deleted or the implementation?

reply
> Same story with tests. Who decides when it's the test that should be changed/deleted or the implementation?

Claude is remarkably good at figuring this is out. I asked it to look at a failing test in a large and messy Python codebase. It found the root cause and then asked whether the failure was either a regression or an insufficiently specified test, performed its own investigation, and found that the test harness was missing mocks that were exposed by the bug fix.

It has become amazingly good at investigating.

reply
Generated tests... I mean... listen to yourself.

I can generate a lot of tests amounting to assert(true). Yeah, LLM generated tests aren't quite that simplistic, but are you checking that all the tests actually make sense and test anything useful? If no, those tests are useless. If yes, I don't actually believe you.

It's the typical 10 line diff getting scrutinized to death, 1000 line diff: Instant LGTM.

Pay attention to YOUR OWN incentives.

reply