But - I don't know if it was April, or May - but very recently - the coding harnesses paired with decent SOTA models like Opus 4.8/GPT 5.5 - just started showing a lot more consistency, and completeness, and sometimes downright clever behavior - that they started to become way more useful.
Just one out of hundred+ examples - I gave Claude Code (Opus 4.8 High) a complex task that involved consul, vault - but I had neglected to give it sandbox permission to download from hashicorp.com. So - it created a entire test harness that simulated both the behavior of Vault and Consul - created all it's test cases, verified that they passed - and when I came back 40 minutes later said that it was all done.
It's test harnesses so accurately simulated the behavior of Vault/Consul - that on first try - no refactoring whatsoever - all of the protobuf/AESGCM/API behavior (that has varied significantly between versions) - worked.
This was something that would have taken me, someone super super familiar with the code and tools and APIs - a minimum of 3 solid days of work - and that would likely involve hundreds of attempts and refactors as I unwound all the weird encryption and packaging layers. It zero-shotted a full solution without having an API to test against
If these agents actually have an actual test-harness - It's honestly hard to imagine what they can't do - subject only to imagination and budget at this point.
Speaking personally - something changed Between January and, Let's say May - in which instead of seeing these things as mostly interesting technology demonstration, in which the flaws outweighed the benefits - I now genuinely think they are the future of programming. I'm dubious that I'll write much software manually in the future - beyond what I do for personal pleasure.