But when we're trying to share results, "a talented engineer sat with the thread and wrote tests/docs/harnesses to guide the model" is less impressive than "we asked it and it figured it out," even though the latter is how real work will happen.
It creates this perverse scenario (which is no one's fault!) where we talk about one-shot performance but one-shot performance is useful in exactly 0 interesting cases.
Sometimes it feels like it's not dissimilar to spending 4 hours to automate a 10 minute task that I thought I'll need forever but ended up just using it once in the past 5 months. But sometimes I unlock something that saves a huge amount of time, and can be reused in many steps of other projects.