The catch is that token spend and quality aren't correlated the way you'd expect. Low-spend months when I'm directing carefully and reviewing every diff tend to produce better code than high-spend months where I'm letting agents run longer chains. The expensive runs generate more code, not necessarily better code.
Jensen's $250k figure only makes sense if you're running dozens of parallel agents continuously. Most engineers are doing something more like augmented pairing. The unit economics are actually pretty good at $100-200/month per person. Beyond that you're hitting diminishing returns unless you've built actual agent infrastructure to parallelize and verify the work.