undefined

points

[-]

We've been really impressed with the performance of ~30B parameter class models and how close they are to the frontier from ~6-12 months ago, which begs the question, are the frontier labs really serving 10T parameter models? Seems unlikely.

If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).

Data at https://gertlabs.com/rankings

by MisterPea11 minutes ago|

parent|

[-]

I exclusively use gemini models and this has been my experience.

I mitigate it by creating dense planning docs for everything and executing iteratively.

Lot's of time wasted on procedure unfortunately

by easygenes1 hours ago|

parent|

prev|

[-]

We know from NVIDIA's public Vera Rubin inference engine marketing materials that the frontier lab models are ~1-2T total.

Mythos is an exception that's larger.

by daemonologist1 hours ago|

prev|

[-]

If this is accurate it raises the question: why is this model so expensive? DeepSeek v4 Flash is 284B total/13B active, FP4/FP8 mixed, and only costs $0.14/$0.28 - even less from OpenRouter. Of course Gemini 3.5 Flash is most likely a better product, and therefore it can command a higher price from an economics perspective, but does this imply Google is taking roughly a 90% profit margin on inference? If so they're either very compute-limited or confident in the model and wanting to recoup training/fixed costs (or both).

by xmonkee1 hours ago|

parent|

[-]

Well, we use flash models extensively (both 2.5 and 3.1) and I cannot overstate this, google cannot fucking serve them without 503s 70% of the time on most days

I think it’s pure economics. Flash models are OP for the price, leads to too much demand, google cannot serve it. This is likely expensive to reduce load and hey, if it still makes money just keep the margin.

by WarmWash1 hours ago|

parent|

prev|

[-]

Rumor is that GCP was happily selling compute to competitors. After all, under the hood, Google is closer to a federation than a corporation. The state of GCP doesn't care about the state of Gemini.

by happyopossum1 hours ago|

parent|

[-]

> Rumor is

It’s not a rumor - there are many public announcements about $B deals around compute for other Ai companies

by zacksiri2 hours ago|

prev|

[-]

Do you have similar math for the flash-lite variant of the models? I'd be curious. Based on my testing / benchmark i think it's around the 100-120B mark.

With the Pro variant being around 600B - 800B

My testing is comparing it's performance / output to other models in the same size range, so not as scientific as yours.

by Maven9112 hours ago|

prev|

[-]

Tell me more about what your day looks like. What do you think of the LLMOps books from Abi, in case you have read it ? Any other resources you can recommed?

by anthonypasq961 hours ago|

prev|

[-]

given this, is it safe to assume that inference pricing is barely related to cost to serve at this point and there is considerable margin?

by nilstenura22 minutes ago|

prev|

[-]

[flagged]