GPT‑5.4 Mini and Nano

upvote

GPT‑5.4 Mini and Nano

(openai.com)

128 points

by meetpateltech2 hours ago |

upvote

by pscanf1 hours ago|

[-]

I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.

They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.

Does anybody else have a similar experience?

reply

upvote

by thanhhaimai6 minutes ago|

[-]

Opinions are my own.

For agentic work, both Gemini 3.1 and Opus 4.6 passed the bar for me. I do prefer Opus because my SIs are tuned for that, and I don't want to rewrite them.

But ChatGPT models don't pass the bar. It seems to be trained to be conversational and role-playing. It "acts" like an agent, but it fails to keep the context to really complete the task. It's a bit tiring to always have to double check its work / results.

reply

upvote

by tom13371 hours ago|

[-]

Yea absolutely. I am using GPT 5.2 / 5.2 Codex with OpenCode and it just doesn't get what I am doing or looses context. Claude on the other side (via GitHub Copilot) has no problem and also discovers the repository on it's own in new sessions while I need to basically spoonfeed GPT. I also agree on the speed. Earlier today I tasked GPT 5.2 Codex with a small refactor of a task in our codebase with reasoning to high and it took 20 minutes to move around 20 files.

reply

upvote

by furyofantares56 minutes ago|

[-]

I don't know any reason to use 5.2, when 5.3 is quite a bit faster.

reply

upvote

by spiderfarmer54 minutes ago|

[-]

If using OpenAI models, use the Codex desktop app, it runs circles around OpenCode.

reply

upvote

by jauntywundrkind40 minutes ago|

[-]

I've had such the opposite experience, but mainly doing agentic coding & little chat.

Codex is an ice man. Every other model will have a thinking output that is meaningful and significant, that is walking through its assumptions. Codex outputs only a very basic idea of what it's thinking about, doesn't verbalize the problem or it's constraints at all.

Codex also is by far the most sycophantic model. I am a capable coder, have my charms, but every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.

Opus I think does a better job of working with me to figure out what to build, and understanding the problem more. But I find it still has a propensity for making somewhat weird suggestions. I can watch it talk itself into some weird ideas. Which at least I can stop and alter! But I find its less reliable at kicking out good technical work.

Codex is plenty fast in ChatGPT+. Speed is not the issue. I'm also used to GLM speeds. Having parallel work open, keeping an eye on multiple terminals is just a fact of life now; work needs to optimize itself (organizationally) for parallel workflows if it wants agentic productivity from us.

I have enormous respect for Codex, and think it (by signficiant measure) has the best ability to code. In some ways I think maybe some of the reason it's so good is because it's not trying to convey complex dimensional exploration into a understandable human thought sequence. But I resent how you just have to let it work, before you have a chance to talk with it and intervene. Even when discussing it is extremely extremely terse, and I find I have to ask it again and again and again to expand.

The one caveat i'll add, I've been dabbling elsewhere but mainly i use OpenCode and it's prompt is pretty extensive and may me part of why codex feels like an ice man to me. https://github.com/anomalyco/opencode/blob/dev/packages/open...

reply

upvote

by pscanf33 minutes ago|

[-]

> I've had such the opposite experience

Yeah, I've actually heard many other people swear by the GPTs / Codex. I wonder what factors make one "click" with a model and not with another.

> Codex is an ice man.

That might be because OpenAI hides the actual reasoning traces, showing just a summary (if I understood correctly).

reply

upvote

by nikanj1 hours ago|

[-]

Same, and I can't put my finger on the "why" either. Plus I keep hitting guard rails for the strangest reasons, like telling codex "Add code signing to this build pipeline, use the pipeline at ~/myotherproject as reference" and codex tells me "You should not copy other people's code signing keys, I can't help you with this"

reply

upvote

by renewiltord1 hours ago|

[-]

Are you requesting reasoning via param? That was a mistake I was making. However with highest reasoning level I would frequently encounter cyber security violation when using agent that self-modifies.

I prefer Claude models as well or open models for this reason except that Codex subscription gets pretty hefty token space.

reply

upvote

by pscanf40 minutes ago|

[-]

Yes, I think? But I was talking more specifically about using the models via API in agents I develop, not for agentic coding. Though, thinking about it, I also don't click with the GPT models when I use them for coding (using Codex). They just seem "off" compared to Claude.

reply

upvote

by renewiltord1 minutes ago|

[-]

I am also talking about agents I'm developing. They just happen to be self-modifying but they're not for agentic coding. You have to explicitly send the reasoning effort parameter. If you set effort to None (default for gpt-5.4) you get very low intelligence.

reply

upvote

by birdsongs52 minutes ago|

[-]

> cyber security violation

Would you mind expanding on this? Do you mean in the resulting code? Or a security problem on your local machine?

I naively use models via our Copilot subscription for small coding tasks, but haven't gone too deep. So this kind of threat model is new to me.

reply

upvote

by renewiltord46 minutes ago|

[-]

No, I mean literal API response. They think I'm using it to hack. See related Github issue: https://github.com/anomalyco/opencode/issues/15776

I don't use OpenCode but looks like it also triggered similar use. My message was similar but different.

reply

upvote

by Tiberium1 hours ago|

[-]

I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:

- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).

- GPT-5.4 Mini averages about 180-190 t/s on API. Priority does nothing for it currently.

- GPT-5.4 Nano is at about 200 t/s.

To put this into perspective, Gemini 3 Flash is about 130 t/s on Gemini API and about 120 t/s on Vertex.

This is raw tokens/s for all models, it doesn't exclude reasoning tokens, but I ran models with none/minimal effort where supported.

And quick price comparisons:

- Claude: Opus 4.6 is $5/$25, Sonnet 4.6 is $3/$15, Haiku 4.5 is $1/$5

- GPT: 5.4 is $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini is $0.75/$4.5, 5.4 Nano is $0.2/$1.25

- Gemini: 3.1 Pro is $2/$12 ($3/$18 for >200K context), 3 Flash is $0.5/$3, 3.1 Flash Lite is $0.25/$1.5

reply

upvote

by coder5431 hours ago|

[-]

I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.

reply

upvote

by JLO6430 minutes ago|

[-]

In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.

I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.

reply

upvote

by BoumTAC2 hours ago|

[-]

To me, mini releases matter much more and better reflect the real progress than SOTA models.

The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.

Meanwhile, when a smaller / less powerful model releases a new version, the jump in quality is often massive, to the point where we can now use them 100% of the time in many cases.

And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.

reply

upvote

by brikym1 hours ago|

[-]

If you're doing something common then maybe there are no differences with SOTA. But I've noticed a few. GPT 5.4 isn't as good at UI work in svelte. Gemini tends to go off and implement stuff even if I prompt it to discuss but it's pretty good at UI code. Claude tends to find out less about my code base than GPT and it abuses the any type in typescript.

reply

upvote

by patates57 minutes ago|

[-]

Big part of these differences may be the system prompts and/or the harness.

reply

upvote

by sebastiennight8 minutes ago|

[-]

> 100% of the time in many cases

So, every single time, the new model works most of the time?

reply

upvote

by pzo2 hours ago|

[-]

they do are cheaper than SOTA but not getting dramatically cheaper but actually the opposite - GPT 5.4 mini is around ~3x more expensive than GPT 5.0 mini.

Similarly gemini 3.1 flash lite got more expensive than gemini 2.5 flash lite.

reply

upvote

by BoumTAC1 hours ago|

[-]

But they are getting dramatically better.

What's the point of a crazy cheap model if it's shit ?

I code most of the time with haiku 4.5 because it's so good. It's cheaper for me than buying a 23€ subscription from Anthropic.

reply

upvote

by philipkglass1 hours ago|

[-]

The crazy cheap models may be adequate for a task, and low cost matters with volume. I need to label millions of images to determine if they're sexually suggestive (this includes but is not limited to nudity). The Gemini 2.0 Flash Lite model is inexpensive and performs well. Gemini 2.5 Flash Lite is also good, but not noticeably better, and it costs more. When 2.0 gets retired this June my costs are going up.

reply

upvote

by HugoDias2 hours ago|

[-]

According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?

GPT 5 mini: Input $0.25 / Output $2.00

GPT 5 nano: Input: $0.05 / Output $0.40

GPT 5.4 mini: Input $0.75 / Output $4.50

GPT 5.4 nano: Input $0.20 / Output $1.25

reply

upvote

by simianwords2 hours ago|

[-]

models are getting costlier but by performance getting cheaper. perhaps they don't see a point supporting really low performance models?

reply

upvote

by HugoDias2 hours ago|

[-]

I would be curious to know if from the enterprise / API consumption perspective, these low-performance models aren't the most used ones. At least it matches our current scenario when it comes to tokens in / tokens out. I'd totally buy the price increase if these are becoming more efficient though, consuming less tokens.

reply

upvote

by karmasimida1 hours ago|

[-]

Those are bigger models. The serving isn’t going to be cheaper.

Why expect cheaper then? The performance is also better

reply

upvote

by mikkelam36 minutes ago|

[-]

Why are we treating LLM evaluation like a vibe check rather than an engineering problem?

Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?

reply

upvote

by tanaros34 minutes ago|

[-]

Whenever somebody makes a benchmark, people complain that the benchmark results are meaningless because they’re gamed. I don’t know why those same people don’t understand that grading on vibes is strictly worse.

reply

upvote

by tintor30 minutes ago|

[-]

Depends on benchmark.

If questions are fixed they are trivial to game.

reply

upvote

by pizza21 minutes ago|

[-]

There’s a Dark Forest problem for evals. As soon as they’re made public they start running out of time to be useful. It’s also not clear how to predict how the model will perform on a task based on an eval. Or even whether, given two skills that the model can individually do well on in the evals, it still does well on their composition. It might at this point be better to be scientific in unscientific approaches, than to attribute more power to relatively weakly predictive evals than they actually have

reply

upvote

by xandrius11 minutes ago|

[-]

Is "Dark Forest problem" an actual name? I just heard of the hypothesis and it has nothing to do with how you used it in this context.

reply

upvote

by sebastiennight6 minutes ago|

[-]

I believe the correct term is "Goodhart's Law": https://en.wikipedia.org/wiki/Goodhart%27s_law

reply

upvote

by technocrat808019 minutes ago|

[-]

5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.

reply

upvote

by cbg02 hours ago|

[-]

Based on the SWE-Bench it seems like 5.4 mini high is ~= GPT 5.4 low in terms of accuracy and price but the latency for mini is considerably higher at 254 seconds vs 171 seconds for GPT5.4. Probably a good option to run at lower effort levels to keep costs down for simpler tasks. Long context performance is also not great.

reply

upvote

by ryao2 hours ago|

[-]

I will be impressed when they release the weights for these and older models as open source. Until then, this is not that interesting.

reply

upvote

by fastpdfai43 minutes ago|

[-]

One thing I really want to find out, is which model and how to process TONS of pdfs very very fast, and very accurate. For prediction of invoice date, accrual accounting and other accounting related purposes. So a decent smart model that is really good at pdf and image reading. While still being very very fast.

reply

upvote

by JLO6422 minutes ago|

[-]

I have a use case somewhat similar to this where I need to convert the content of PDFs in a non standard format to a specific YAML format. I currently use Haiku for this and am pleased with the accuracy/speed (I haven't tried scanned PDFs yet tho) however I've been thinking about fine tuning a small Qwen model for just this task. I can't yet justify the effort to investigate it but I imagine it could work out.

reply

upvote

by derefr23 minutes ago|

[-]

OpenAI don't talk about the "size" or "weights" of these models any more. Anyone have any insight into how resource-intensive these Mini/Nano-variant models actually are at this point?

I assume that OpenAI continue to use words like "mini" and "nano" in the names of these model variants, to imply that they reserve the smallest possible resource-units of their inference clusters... but, given OpenAI's scale, that may well be "one B200" at this point, rather than anything consumers (or even most companies) could afford.

I ask because I'm curious whether the economics of these models' use-cases and call frequency work out (both from the customer perspective, and from OpenAI's perspective) in favor of OpenAI actually hosting inference on these models themselves, vs. it being better if customers (esp. enterprise customers) could instead license these models to run on-prem as black-box software appliances.

But of course, that question is only interesting / only has a non-trivial answer, if these models are small enough that it's actually possible to run them on hardware that costs less to acquire than a year's querying quota for the hosted version.

reply

upvote

by technocrat808017 minutes ago|

[-]

Have they ever talked about their size or weights?

reply

upvote

by derefr12 minutes ago|

[-]

They never put the parameter counts in their model names like other AI companies did, but back in the GPT3 era (i.e. before they had PR people sitting intermediating all their comms channels), OpenAI engineers would disclose this kind of data in their whitepapers / system cards.

IIRC, GPT-3 itself was admitted to be a 175B model, and its reduced variants were disclosed to have parameter-counts like 1.3B, 6.7B, 13B, etc.

reply

upvote

by tintor27 minutes ago|

[-]

Several customer testimonials for GPT-5.4 Mini have em dashes in them.

Did GPT write them?

reply

upvote

by beklein1 hours ago|

[-]

As a big Codex user, with many smaller requests, this one is the highlight: "In Codex, GPT‑5.4 mini is available across the Codex app, CLI, IDE extension and web. It uses only 30% of the GPT‑5.4 quota, letting developers quickly handle simpler coding tasks in Codex for about one-third the cost." + Subagents support will be huge.

reply

upvote

by hyperbovine1 hours ago|

[-]

Having to invoke `/model` according to my perceived complexity of the request is a bit of a deal breaker though.

reply

upvote

by serf1 hours ago|

[-]

you use profiles for that [0], or in the case of a more capable tool (like opencode) they're more confusing referred to as 'agents'[1] , which may or may not coordinate subagents..

So, in opencode you'd make a "PR Meister" and "King of Git Commits" agent that was forced to use 5.4mini or whatever, and whenever it fell down to using that agent it'd do so through the preferred model.

For example, I use the spark models to orchestrate abunch of sub-agents that may or may not use larger models, thus I get sub-agents and concurrency spun up very fast in places where domain depth matter less.

[0]: https://developers.openai.com/codex/config-advanced#profiles [1]: https://opencode.ai/docs/agents/

reply

upvote

by 6thbit1 hours ago|

[-]

Looking at the long context benchmark results for these, sounds like they are best fit for also mini-sized context windows.

Is there any harness with an easy way to pick a model for a subagent based on the required context size the subagent may need?

reply

upvote

by bananamogul1 hours ago|

[-]

They could call them something like “sonnet” and “haiki” maybe.

reply

upvote

by kseniamorph51 minutes ago|

[-]

wow, not bad result on the computer use benchmark for the mini model. for example, Claude Sonnet 4.6 shows 72.5%, almost on par with GPT-5.4 mini (72.1%). but sonnet costs 4x more on input and 3x more on output

reply

upvote

by dack1 hours ago|

[-]

i want 5.4 nano to decide whether my prompt needs 5.4 xhigh and route to it automatically

reply

upvote

by mrtesthah38 minutes ago|

[-]

As per OpenAI themselves, xhigh is only necessary if the agent gets stuck on a long running task. Otherwise it’s thinking trades use so many tokens of context that it’s less effective than high for a great majority of tasks. This has also been my experience.

reply

upvote

by exitb49 minutes ago|

[-]

Like any work estimation, it will likely disappoint.

reply

upvote

by powera2 hours ago|

[-]

I've been waiting for this update.

For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.

The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?

reply

upvote

by HugoDias2 hours ago|

[-]

For us, it was also pretty good, but the performance decreased recently, that forced us to migrate to haiku-4.5. More expensive but much more reliable (when anthropic up, of course).

reply

upvote

by throwaway9112822 hours ago|

[-]

they dont change the model weights (no frontier lab does). if you have evals and all prompts, tool calls the same, I'm curious how you are saying performance decreased..

reply

upvote

by powera31 minutes ago|

[-]

So far on my (simple) benchmarks, GPT-5.4-mini is looking very good. GPT-5.4-mini is about 30% faster than GPT-5-mini. GPT-5.4-mini gets 80% on the "how many Rs in Strawberry" test, and nearly perfect scores on everything else I threw at it.

GPT-5.4-nano is less impressive. I would stick to gpt-5.4-mini where precise data is a requirement. But it is fast, and probably cheaper and better quality than an 8-20B parameter local model would be.

( https://encyclopedia.foundation/benchmarks/dashboard/ for details - the data is moderately blurry - some outlier (15s) calls are included, a few benchmark questions are ambiguous, and some prices shown are very rough estimates ).

reply

upvote

by yomismoaqui1 hours ago|

[-]

Not comparing with equivalent models from Anthropic or Google, interesting...

reply

upvote

by Tiberium1 hours ago|

[-]

They did actually compare them in the tweet, see https://x.com/OpenAI/status/2033953592424731072

Direct image: https://pbs.twimg.com/media/HDoN4PhasAAinj_?format=png&name=...

reply

upvote

by simianwords2 hours ago|

[-]

why isn't nano available in codex? could be used for ingesting huge amount of logs and other such things

reply

upvote

by patates52 minutes ago|

[-]

IMHO the best way is to let a SOTA model have a look at bunch of random samples and write you tools to analyze those.

I think, no model, SOTA or not, has neither the context nor the attention to be able to do anything meaningful with huge amount of logs.

reply

upvote

by machinecontrol2 hours ago|

[-]

What's the practical advantage of using a mini or nano model versus the standard GPT model?

reply

upvote

by aavci2 hours ago|

[-]

Cheaper. Every month or so I visit the models used and check whether they can be replaced by the cheapest and smallest model possible for the same task. Some people do fine tuning to achieve this too.

reply

upvote

by varispeed1 hours ago|

[-]

I stopped paying attention to GPT-5.x releases, they seem to have been severely dumbed down.

reply

upvote

by casey21 hours ago|

[-]

I googled all the testimonial names and they are all linked-in mouthpieces.

reply

upvote

by miltonlost1 hours ago|

[-]

Does it still help drive people to psychosis and murder and suicide? Where's the benchmark for that?

reply

upvote

by reconnecting1 hours ago|

[-]

All three ChatGPT models (Instant, Thinking, and Pro) have a new knowledge cutoff of August 2025.

Seriously?

reply

upvote

by dpoloncsak1 hours ago|

[-]

Do you find the results vary based on whether it uses RAG to hit the internet vs the data being in the weights itself? I'm not sure I've really noticed a difference, but I don't often prompt about current events or anything.

reply

upvote

by reconnecting58 minutes ago|

[-]

I noticed that many recent technologies are not familiar to LLMs because of the knowledge cutoff, and thus might not appear in recommendations even if they better match the request.

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by zild3d1 hours ago|

[-]

whats surprising about that? most of the minor version updates from all the labs are post training updates / not changing knowledge cutoff

reply

upvote

by reconnecting1 hours ago|

[-]

Thanks for letting me know, I will be waiting for the major update.

reply

upvote

by F7F7F759 minutes ago|

[-]

It's been like this since GPT 3.5. This is not a limitation and is generally considered a natural outcome of the process.

So there's no major update in the sense that you might be thinking. Most of the time there's not even an announcement when/if training cut offs are updated. It's just another byline.

A 6 month lag seems to be the standard across the frontier models.

reply

upvote

by reconnecting53 minutes ago|

[-]

I've actually started worrying that the amount of false data produced with LLMs on the public internet might provoke a situation where the knowledge cutoff becomes permanently (and silently) frozen. Like we can't trust data after 2025 because it will poison training data at scale, and models will only cover major events without capturing the finer details.

reply

upvote

by gwern20 minutes ago|

[-]

I agree. That's why you should write as much as you can now, if you want to get it into the LLMs (https://gwern.net/blog/2024/writing-online). You never know when the window will slam shut and LLM training goes 'hermetic' as they focus on 'civilization in a datacenter' where only extremely vetted whitelisted data gets included in the 'seed' and everything is reconstructed from scratch for the training value & safety.

reply

upvote

by system21 hours ago|

[-]

I am feeling the version fatigue. I cannot deal with their incremental bs versions.

reply

upvote

by beernet30 minutes ago|

[-]

Crazy how OAI is way behind now and the only one to blame is Sam, his ego and lust for influence. Their downwards trajectory of paying accounts since "the move" (DoW deal) is an open secret. If you had placed a new CEO at OAI six months ago and told him to destroy the company, it would have been hard for that CEO to do a better job at that than Sam did. Should have left when he was let go but decided to go full Greg and MAGA instead. Here we are. Go Dario

reply

upvote

by beernet6 minutes ago|

[-]

Just to elaborate, as I am getting downvoted by tech bros:

OpenAI restructures after Anthropic captures 70% of new enterprise deals. Claude Code hits $2.5B while Codex lags at $1B ahead of dual IPOs.

Src: https://www.implicator.ai/openai-cuts-its-side-quests-the-en...

reply