But what about real price in real agentic use? For example, Opus 4.5 was more expensive per token than Sonnet 4.5, but it used a lot less tokens so final price per completed task was very close between the two, with Opus sometimes ending up cheaper
How can you determine whether it's as good as Opus 4.5 within minutes of release? The quantitative metrics don't seem to mean much anymore. Noticing qualitative differences seems like it would take dozens of conversations and perhaps days to weeks of use before you can reliably determine the model's quality.
Just look at the testimonials at the bottom of introduction page, there are at least a dozen companies such as Replit, Cursor, and Github that have early access. Perhaps the GP is an employee of one of these companies.