Personally what I've found that has made coding agents more and more useful over the last year is that they have gotten a higher and higher floor, not that they have gotten a higher and higher ceiling. They were already plenty smart a year ago, it was just that they failed so often and so spectacularly that it made them a liability. Now they have become much more reliable, which is the key thing that has transitioned them into being actually useful. For the most part I don't use them to work on really intellectually difficult tasks. I mostly use them to work on very boring and labor intensive tasks. Most commercial software development work is just boring drudgery like this. Certainly the bulk of what I need them for is. I need them to just not crap their pants all the time while they're at it.
So I'm kinda wary seeing the poor reliability of Kimi.
DNP = Did not participate
In this regard, kimi got more and better medals than Claude.