I think competition will get fierce. We see many people are attracted to the price stability of GHCP - it became clear what a request could do - the problem is that they didn't match results with cost. It's not clear what a 5 hour usage window in Claude Code can do.
There's no reason the harness couldn't provide a quote on the next request, aside from it takes effort and it would be upfront to the user, creating expectations.
I don't mind a PAYG model for a simple chat interface. But when it comes to actually producing things, you burn through TONS of tokens creating the wrong output.
That's already the case if you can self-host an LLM; you don't even need a mythical H200: gamer-grade GeForce cards can get you a long way there (if this page is to be believed: https://www.runpod.io/gpu-compare/rtx-5090-vs-h200 )
...after RAM prices return to normalcy, of course - and then wait another 2 or 3 generations of GPU development for a 96GB HBM card to hit the streets - and also assuming SotA or cloud-only LLMs don't experience lifestyle-inflation, but I assume they must, because OpenAI/Anthropic/Etc's business-model depends on people paying them to access them, so it's in their interests to make it as difficult as possible to run them locally.
Give it 5 years from now and reassess.