Gemini 2.5 flash (27 score): $172 (1.0x)
Gemini 2.5 pro (35 score): $649 (3.8x)
Gemini 3.0 Flash (46 score): $278 (1.6x)
Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)
This is a massive price increase... 5.6x compared to Gemini 3.0 Flash
From what I hear, most enterprise AI deployments are seat-based subscriptions with annual commitments.
Amusingly, Enterprise credits are more expensive than just paying a zero-commitment on-demand API fee. Personal accounts are still the best value.
People really can’t wait to be the next Zynga
Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.
My guess: it's the price at which they make more money than if they rent the TPUs to other companies.
The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?
Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.
You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.
Flash seems to be targeting the near-frontier category.
I think frontier models will be invaluable for scientific research, defense, financial analysis and such. But the average person probably would be reasonably well-served with a local model.
If you're in sales, customer service, product management and such - the leading open models at the 30B mark are already good enough.
Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.
https://www.together.ai/pricing
https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)
Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.
But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.
...my opinions here are of course, conjecture built on top of conjecture....
I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.
The economic value increases non-linearly as models get more intelligent: being 10% more capable unlocks way more than 10% in downstream value.
That's trouble because the non-linear component means at some point their margins will stop primarily defined by the cost of compute, and start being dominated by how intelligent the model is.
At that point you can expect compute prices to skyrocket and free capacity to plummet, so even if you have a model that's "good enough", you can't afford to deploy it at scale.
(and in terms of timing, I think they're all well under the curve for pricing by economic value. Everyone is talking about Uber spending millions on tokens, but how much payroll did they pay while devs scrolled their phones and waited for CC to do their job?)
Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.
The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.
And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.
The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).
You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.
DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.
People report good results from DeepSeek V4 Flash at 2 bits (the DwarfStar 4 folks are doing it, and I've tried it on my Strix Halo, but it's too slow to be usable, so I haven't bothered to figure out if it's actually smart enough to use for anything).
Anyway, it's obvious models have to degrade in terms of knowledge, at any quantization, even though it may not show up clearly on benchmarks until lower. If you halve the size of the data available, it necessarily loses information about the world.
This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.
But even when it happens I doubt it would be as cheap as it is right now. Enjoy it while it lasts!
Please go run some numbers.The hardware needed to Run Deepseek v4 flash at 20 tps for a single session is nowhere close to what is required to run it at 50tps for 5,000 concurrent sessions.
Imagine what it takes to be profitible when running at 150 tps for 30cents per 1mm. You make less than 1k per month and the hardware required to run that cost 10k a month to rent with hardly any concurrent session capability.
- DeepSeek serves DeepSeek V4 Pro at 27 tps: https://openrouter.ai/deepseek/deepseek-v4-pro
- At 27 tps per user, a B300 GPUS will give you around 800 tokens per second (serving 30 users): https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...
- That's 800 * 60 * 60 generated tokens per hour, at a cost of $0.87 per 1M tokens, or $2.50 per hour.
- For input and output tokens, the math is a bit more complicated because we have to make assumptions about their ratio. Using the published values from OpenCode, we get another $2.50 for cached tokens (which are almost free for DeepSeek) and another $3.40 for input tokens (which are a lot cheaper to compute than output tokens), which gives us a total of $8.50 per hour per B300 GPU.
- B300 GPUs can be rented for as low as $3.40 per hour, which is less than $8.50, so hosting DeepSeek V4 Pro is profitable.
You could also host it at fewer tps per user to raise the efficiency and therefore the profit even higher.
Smh, it's all downhill from the first unadulterated neuron.
I think it is priced high because it's basically their smartest model as well as their fastest, so why shouldn't they?
You can still use earlier generations of Flash at a lower cost if you want "fast and cheap and just OK," which often makes sense. (Just checked)
I would predict they will lower this price when 3.5 High appears, but perhaps not all the way.
Just like in software, some of the most beautiful solutions come from constraints. Think, the optimisations that game developers implemented because of the frame budget.
Or if you prefer smaller ones, Qwen3.6-35B-A3B, https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF
https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...
3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.
For comparison, Opus models are $5/$25
Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.
Outside of coding, claude models are pretty meh. GPT and Gemini are the workhorses of science/math/finance.
They sure are not at thorough analysis or debugging, etc.
I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.
That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.
Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.
Question is are you going to persuade anyone with this argument?
Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.
A few weeks ago, Steve Yegge claimed he'd heard that Google employees are banned from using Claude & Codex.
https://x.com/Steve_Yegge/status/2046260541912707471
A number of Googlers replied to say that was totally false, including Demis Hassabis, but they were all on the DeepMind team.
https://x.com/demishassabis/status/2043867486320222333
This person here claims they left Google because of the ban, and because the ban applied outside of Google work as well:
I think false (or hasn't filtered to everyone lol)
Empty Slot (new Pro as Mythos competitor?)
Old Pro -> now Flash
Old Flash -> now Flash Lite
Old Flash Lite -> now Gemma (and not served by Google)
I say "almost" because the situation is more fluid and unstable than a normal naming change. If Apple were to do this with laptops, maybe it'd be like, Air gets better and pricier and becomes Pro-level model, Neo same way becomes Air-level model, etc. But Apple's too design oriented to do something like that. Google, well...
This change has made me decide to move to a multi-provider situation like through OpenRouter for consumer-facing LLM api in a service I'm building. I just can't trust Google to not constantly rearrange everything under our feet. Doesn't mean I won't use Gemini, but it clearly means I need to have others in the mix ready to go. In fact I used to use lots of Flash Lite, which is now Gemma territory, and I can't get that served by Google anymore and don't want to run my own hardware.
But in any case, I'd compare this "Flash" model with previous "Pro" on all metrics. It's kinda like if in clothes a Small suddenly became what was a Large, or at Starbucks a Grande became the new de facto Venti. And only for the new! drinks.
And if we think this way, it's possible that prices are actually falling?
Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.
Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.
This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).
It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).
Profitable maybe, in terms of having low costs, but why pay Google or whoever when you can do it yourself for cheaper/"free"?
The value of the firm's operating assets = EBIT(1-t) - Reinvestment
You (Anthropic) want that sky-high valuation? Accept reinvestment is part of the equation.
If they decide to stop reinvesting, then they are as good as dead.
Moreover, they clearly are not re-investing cash flows from operations. Why do you think they are continually raising money? Lmao.
Ed Zitron and Gary Marcus are... confused.
Amazon was unprofitable because they poured their revenue into growth. On paper, they were in the red, but everyone - especially investors - saw what was going to happen, given their trajectory.
Is it the case that any of these AI companies are actually making a ton of money and growing accordingly? AFAICT, we've just got [a] big players like Google that can subsidize AI in the hopes of waiting everyone else out and [b] private companies raising capital in the hopes that when the market returns to rationality, they may be solvent.
> HSBC Global Investment Research projects that OpenAI still won’t be profitable by 2030, even though its consumer base will grow by that point to comprise some 44% of the world’s adult population (up from 10% in 2025). Beyond that, it will need at least another $207 billion of compute to keep up with its growth plans.
This article is from six months ago. Was HSBC wrong; did something dramatically change in the last six months; is OpenAI not, in fact, profitable?, or are they in fact doing well but doing a huge investment (as was the case with Amazon 25ish years ago)?
I genuinely do not know, but my impression is that they're burning investment capital trying to compete with others' investment capital and Google's bottomless pockets.
[0] https://fortune.com/2025/11/26/is-openai-profitable-forecast...
Whoever buys the stock at a richly priced 1tn at ipo is a bozo lmao. I know I know, index funds will be forced to hold it bypassing the 1 year rule. Disaster already.
The trend lines are going in the opposite direction.
That's not to say they will be or are wrong, it's just that they aren't exactly unbiased, or humble, sources.
The small models are useful for small things like summarizing text or search but not much else.
Even anthropic who does not own any hardware still have a big margin providing claude models.
Google has just recently upgraded their TPUs.
It's pretty funny that everyone say that this business is unsustainable, but I have yet seen anyone bankrupt, even the pure hardware providers who are renting out a100 b200.
I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.
Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.
and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)
Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.