undefined

upvote

points

by spankalee8 hours ago |

upvote

by bluegatty3 hours ago|

[-]

Yes, this is very true and it speaks strongly to this wayward notion of 'models' - it depends so much on the tuning, the harness, the tools.

I think it speaks to the broader notion of AGI as well.

Claude is definitively trained on the process of coding not just the code, that much is clear.

Codex has the same limitation but not quite as bad.

This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.

Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.

reply

upvote

by rhubarbtree2 hours ago|

[-]

Google are stuck because they have to compete with OpenAI. If they don’t, they face an existential threat to their advertising business.

But then they leave the door open for Anthropic on coding, enterprise and agentic workflows. Sensibly, that’s what they seem to be doing.

That said Gemini is noticeably worse than ChatGPT (it’s quite erratic) and Anthropic’s work on coding / reasoning seems to be filtering back to its chatbot.

So right now it feels like Anthropic is doing great, OpenAI is slowing but has significant mindshare, and Google are in there competing but their game plan seems a bit of a mess.

reply

upvote

by frogperson1 hours ago|

[-]

Google might be a mess now, but they have time. OpenAI and Anthropic are on barrowed time, Google has a built in money printer. They just need to outlast the others.

reply

upvote

by harrall47 minutes ago|

[-]

Plus they started making AI processors 11 years ago and invented the math behind “GPTs” 9 years ago. Gemini is way cheaper to run for them than it does for everyone else.

I think Gemini is really built for their biggest market — Google Search. You ask questions and get answers.

I’m sure they’ll figure out agentic flows. Google is always a mess when it comes to product. Don’t forget the Google chat sagas where it seems as if different parts of the company were making the same product.

reply

upvote

by seg_lol5 minutes ago|

[-]

They have much much less time than one would think. Their ads business is about to go into freefall, this will cause the whole company to spiral.

reply

upvote

by bluegatty2 hours ago|

[-]

Yup, you got it. It's a weird situation for sure.

You know what's also weird: Gem3 'Pro' is pretty dumb.

OAI has 'thinking levels' which work pretty well, it's nice to have the 'super duper' button - but also - they have the 'Pro' product which is another model altogether and thinks for 20 min. It's different than 'Research'.

OAI Pro (+ maybe Spark) is the only reason I have OAI sub. Neither Anthropic nor Google seem to want to try to compete.

I feel for the head of Google AI, they're probably pulled in major different directions all the time ...

reply

upvote

by visarga2 hours ago|

[-]

If you want that level of research I suggest you ask the model to draft a markdown plan with "[ ]" gates for todo items, and plan it in as many steps as needed. Then ask another LLM to review the plan, judge it. In the end use the plan as the execution state tracker, the model solves one by one the checkboxes.

Using this method I could recreate "deep research" mode on a private collection of documents in a few minutes. A markdown file can be like a script or playbook, just use checkboxes for progress. This works for models that have file storage and edit tools, which is most, starting with any coding agent.

reply

upvote

by bluegatty2 hours ago|

[-]

OAI Pro is not a 'research' tool in that sense, and it's definitely different than the 'deep research' options avail on most platforms, as I indicated.

It's a different kind of solution altogether.

I suggest trying it.

reply

upvote

by dakolli44 minutes ago|

[-]

They all suck!!!

reply

upvote

by jacquesm2 hours ago|

[-]

Google is scoring one own goal after another by making people working with their own data wonder how much of that data is sent off to be used to train their AI on. Without proof to the contrary I'm going to go with 'everything'.

They should have made all of this opt-in instead of force-feeding it to their audience, which they wrongly believe to be captive.

reply

upvote

by spankalee2 hours ago|

[-]

> Claude is definitively trained on the process of coding not just the code

This definitely feels like it.

It's hard to really judge, but Gemini feels like it might actually write better code, but the _process_ is so bad that it doesn't matter. At first I thought it was bad integration by the GitHub Copilot, but I see it elsewhere now.

reply

upvote

by juleiie25 minutes ago|

[-]

I don’t think Gemini writes better code, not 3.0 at least.

Maybe with good prompt engineering it does? admittedly I never tried to tell it to not hard code stuff and it just was really messy generally. Whereas Claude somehow can maintain perfect clarity to its code and neatness and readability out of the box.

Claude’s code really is much easier to understand and immediately orient around. It’s great. It’s how I would write it for myself. Gemini while it may work is just a total mess I don’t want to have in my codebase at all and hate to let it generate my files even if it sometimes finds solutions to problems Claude doesn’t, what’s the use of it if it is unreadable and hard to maintain.

reply

upvote

by andai3 hours ago|

[-]

Tell me more about Codex. I'm trying to understand it better.

I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.

I think that's partly the personality and tone, but I think it goes deeper than that.

(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)

reply

upvote

by fhub2 hours ago|

[-]

> Opus feels more like a guy to me, while Codex feels like a machine

I use one to code and the other to review. Every few days I switch who does what. I like that they are different it makes me feel like I'm getting different perspectives.

reply

upvote

by bluegatty2 hours ago|

[-]

Your intuition is exactly correct - it's not just 'tone' it's 'deeper than that'.

Codex is a 'poor communicator' - which matters surprisingly a lot in these things. It's overly verbose, it often misses the point - but - it is slightly stronger in some areas.

Also - Codex now has 'Spark' which is on Cerebras, it's wildly fast - and this absolutely changes 'workflow' fundamentally.

With 'wait-thinking' - you an have 3-5 AIs going, because it takes time to process but with Cerebras-backed models ... maybe 1 or 2.

Basically - you're the 'slowpoke' doing the thinking now. The 'human is the limiting factor'. It's a weird feeling!

Codex has a more adept 'rollover' on it's context window it sort of magically does context - this is hard to compare to Claude because you don't see the rollover points as well. With Claude, it's problematic ... and helpful to 'reset' some things after a compact, but with Codex ... you just keep surfing and 'forget about the rollover'.

This is all very qualitative, you just have to try it. Spark is only on the Pro ($200/mo) version, but it's worth it for any professional use. Just try it.

In my workflow - Claude Code is my 'primary worker' - I keep Codex for secondary tasks, second opinions - it's excellent for 'absorbing a whole project fast and trying to resolve an issue'.

Finally - there is a 'secret' way to use Gemini. You can use gemeni cli, and then in 'models/' there is a way to pick custom models. In order to make Gem3 Pr avail, there is some other thing you have to switch (just ask the AI), and then you can get at Gem3 Pro.

You will very quickly find what the poster here is talking about: it's a great model, but it's a 'Wild Stallion' on the harness. It's worth trying though. Also note it's much faster than Claude as well.

reply

upvote

by embedding-shape2 hours ago|

[-]

Spark is fun and cool, but it isn't some revolution. It's a different workflow, but not suitable for everything that you're use GPT5.2 for with thinking set to high, for example, it's way more dumb and makes more mistakes, while 5.2 will carefully thread through a large codebase and spend 40 minutes just to validate the change actually didn't break anything, as long as you provide prompts for it.

Spark on the other hand is a bit faster at reaching a point when it says "Done!", even when there is lots more it could do. The context size is also very limiting, you need to really divide and conquer your tasks, otherwise it'll gather files and context, then start editing one file, trigger the automatic context compaction, then forget what it was doing and begin again, repeating tons of time and essentially making you wait 20 minutes for the change anyways.

Personally I keep codex GPT5.2 as the everyday model, because most of the stuff I do I only want to do once, and I want it to 100% follow my prompt to the letter. I've played around a bunch with spark this week, and been fun as it's way faster, but also completely different way of working, more hands-on, and still not as good as even the gpt-codex models. Personally I wouldn't get ChatGPT Pro only for Spark (but I would get it for the Pro mode in ChatGPT, doesn't seem to get better than that).

reply

upvote

by bluegatty2 hours ago|

[-]

Spark is the 'same model and harness' but on Cerebras.

Your intuition may be deceiving you, maybe assuming it's a speed/quality trade-off, it's not.

It's just faster hardware.

No IQ tradeoff.

If you toy around with Cerebras directly, you get a feel for it.

Edit: see note below, I'm wrong. Not same model.

reply

upvote

by striking1 hours ago|

[-]

> Today, we’re releasing a research preview of GPT‑5.3‑Codex‑Spark, a smaller version of GPT‑5.3‑Codex, and our first model designed for real-time coding.

from https://openai.com/index/introducing-gpt-5-3-codex-spark/, emphasis mine

reply

upvote

by bluegatty45 minutes ago|

[-]

You're right. It's funny because I kind of noticed that, but with all of these subtle model issues, I'm so used to being distraught by the smallest thing I've had to learn to 'trust the data' aka the charts, model standings, performance, etc. and in this case, I was under the assumption 'it was the same model' clearly it's not.

Which is a bummer because it would be nice to try a true side-by-side analysis.

reply

upvote

by Bnjoroge18 minutes ago|

[-]

Agree with this except that spark is good or worth it. Absolutely not for $200, it's a step or two below opus 4.6 for actual reasoning.

reply

upvote

by karmasimida7 hours ago|

[-]

Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.

OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly

reply

upvote

by onlyrealcuzzo6 hours ago|

[-]

Because Search is not agentic.

Most of Gemini's users are Search converts doing extended-Search-like behaviors.

Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

reply

upvote

by Macha6 hours ago|

[-]

> Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

I do wonder what percentage of revenue they are. I expect it's very outsized relative to usage (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

reply

upvote

by curly65 hours ago|

[-]

> Most agent actions on our public API are low-risk and reversible. Software engineering accounted for nearly 50% of agentic activity, but we saw emerging usage in healthcare, finance, and cybersecurity.

via Anthropic

https://www.anthropic.com/research/measuring-agent-autonomy

this doesn’t answer your question, but maybe Google is comfortable with driving traffic and dependency through their platform until they can do something like this

https://www.adweek.com/media/google-gemini-ads-2026/

reply

upvote

by onlyrealcuzzo5 hours ago|

[-]

> (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

Nobody is paying for Search. According to Google's earnings reports - AI Overviews is increasing overall clicks on ads and overall search volume.

reply

upvote

by bayindirh4 hours ago|

[-]

So, apparently switching to Kagi continues to pay in dividends, elegantly.

No ads, no forced AI overview, no profit centric reordering of results, plus being able to reorder results personally, and more.

reply

upvote

by nimchimpsky13 minutes ago|

[-]

[dead]

reply

upvote

by alphabetting6 hours ago|

[-]

the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big from 3.0 to 3.1.

For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:

1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%

reply

upvote

by kakugawa4 hours ago|

[-]

In mid-2024, Anthropic made the deliberate decision to stop chasing benchmarks and focus on practical value. There was a lot of skepticism at the time, but it's proven to be a prescient decision.

reply

upvote

by girvo4 hours ago|

[-]

Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.

I'll withhold judgement until I've tried to use it.

reply

upvote

by avereveard2 hours ago|

[-]

What's your opinion of glm5 if you had a chance to use it

reply

upvote

by girvo19 minutes ago|

[-]

I haven’t yet, though I will be this weekend!

reply

upvote

by metadat3 hours ago|

[-]

Ranking Codex 5.2 ahead of plain 5.2 doesn't make sense. Codex is expressly designed for coding tasks. Not systems design, not problem analysis, and definitely not banking, but actually solving specific programming tasks (and it's very, very good at this). GPT 5.2 (non-codex) is better in every other way.

reply

upvote

by nl2 hours ago|

[-]

Codex has been post-trained for coding, including agentic coding tasks.

It's certainly not impossible that the better long-horizon agentic performance in Codex overcomes any deficiencies in outright banking knowledge that Codex 5.2 has vs plain 5.2.

reply

upvote

by 306bobby3 hours ago|

[-]

It could be problem specific. There are certain non program things that opus seems better than sonnet at as well

reply

upvote

by 306bobby3 hours ago|

[-]

Swapped sonnet and opus on my last reply, oops

reply

upvote

by blueaquilae4 hours ago|

[-]

Marketing team agree with benchmark score...

reply

upvote

by HardCodedBias5 hours ago|

[-]

LOL come on man.

Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).

If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.

reply

upvote

by drivebyhooting3 hours ago|

[-]

You can’t put Gemini and Meta in the same sentence. Llama 4 was DOA, and Meta has given up on frontier models. Internally they’re using Claude.

reply

upvote

by not_ai2 hours ago|

[-]

After spending all that money and firing a bunch of people? Is the new group doing anything at this point?

reply

upvote

by dekhn1 hours ago|

[-]

They are busy demonstrating that Mark Zuckerberg has no sense at all.

reply

upvote

by hintymad4 hours ago|

[-]

My guess is that Gemini team didn't focus on the large-scale RL training for the agentic workload. And they are trying to catch up with 3.1.

reply

upvote

by swftarrow3 hours ago|

[-]

I suspect a large part of Google's lag is due to being overly focused on integrating Gemini with their existing product and app lines.

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by ionwake6 hours ago|

[-]

Can you explain what you mean by its bad at agentic stuff?

reply

upvote

by karmasimida6 hours ago|

[-]

Accomplish the task I give to it without fighting me with it.

I think this is classic precision/recall issue: the model needs to stay on task, but also infer what user might want but not explicitly stated. Gemini seems particularly bad that recall, where it goes out of bounds

reply

upvote

by ionwake3 hours ago|

[-]

cool thanks for the explanation

reply

upvote

by renegade-otter4 hours ago|

[-]

It's like anything Google - they do the cool part and then lose interest with the last 10%. Writing code is easy, building products that print money is hard.

reply

upvote

by miohtama3 hours ago|

[-]

One does not need products if you have monopoly on search

reply

upvote

by margorczynski3 hours ago|

[-]

That monopoly is worth less as time goes by and people more and more use LLMs or similar systems to search for info. In my case I've cut down a lot of Googling since more competent LLMs appeared.

reply

upvote

by Bnjoroge17 minutes ago|

[-]

Agree, even through gemini cli, gemini 3 has just been underwhelming. You can clearly tell, the agentic harness/capability wasnt native to the model at all. Just patched on it

reply

upvote

by avereveard2 hours ago|

[-]

Yeah gemini 3.0 is unusable to me, to an extent all models do things right or wrong, but gemini just refuses to elaborate.

Sometime you can save so much time asking claude codex and glm "hey what you think of this problem" and have a sense wether they would implement it right or not.

Gemini never stops instead goes and fixes whatever you trow at it even if asked not to, you are constantly rolling the dice but with gemini each roll is 5 to 10 minutes long and pollutes the work area.

It's the model I most rarely use even if, having a large google photo tier, I get it for basically free between antigravity, gemini-cli and jules

For all its fault anthropic discovered pretty early with claude 2 that intelligence and benchmark don't matter if the user can't steer the thing.

reply

upvote

by s3p7 hours ago|

[-]

Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or "I'm meticulously crafting the answer"

reply

upvote

by foz6 hours ago|

[-]

This is part of the reason I don't like to use it. I feel it's hiding things from me, compared to other models that very clearly share what they are thinking.

reply

upvote

by dumpsterdiver4 hours ago|

[-]

To be fair, considering that the CoT exposed to users is a sanitized summary of the path traversal - one could argue that sanitized CoT is closer to hiding things than simply omitting it entirely.

reply

upvote

by mikestorrent4 hours ago|

[-]

This is something that bothers me. We had a beautiful trend on the Web of the browser also being the debugger - from View Source decades ago all the way up to the modern browser console inspired by Firebug. Everything was visible, under the hood, if you cared to look. Now, a lot of "thinking" is taking place under a shroud, and only so much of it can be expanded for visibility and insight into the process. Where is the option to see the entire prompt that my agent compiled and sent off, raw? Where's the option to see the output, replete with thinking blocks and other markup?

reply

upvote

by fragmede2 hours ago|

[-]

If that's what you're after, tou MITM it and setup a proxy so Claude Code or whatever sends to your program, and then that program forwards it to Anthropics's server (or whomever). That way, you get everything.

reply

upvote

by mikestorrent1 hours ago|

[-]

I'm aware that this is possible, and thank you for the suggestion, but surely you can see that it's a relatively large lift; may not work in controlled enterprise environments; and compared to just right click -> view source it's basically inaccessible to anyone who might have wanted to dabble.

reply

upvote

by raducu3 hours ago|

[-]

> Don't get me started on the thinking tokens.

Claude provides nicer explanations, but when it comes to CoT tokens or just prompting the LLM to explain -- I'm very skeptical of the truthfulness of it.

Not because the LLM lies, but because humans do that also -- when asked how the figured something, they'll provide a reasonable sounding chain of thought, but it's not how they figured it out.

reply

upvote

by ceroxylon2 hours ago|

[-]

I once saw "now that I've slept on it" in Gemini's CoT... baffling.

reply

upvote

by dist-epoch6 hours ago|

[-]

That's not the real thinking, it's a super summarized view of it.

reply

upvote

by raducu3 hours ago|

[-]

> Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress.

Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.

Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.

reply

upvote

by RachelF3 hours ago|

[-]

I haven't used 3.1 yet, but 3.0 Pro has been frustrating for two reasons:

- it is "lazy": I keep having to tell it to finish, or continue, it wants to stop the task early.

- it hallucinates: I have arguments with it about making up API functions to well known libraries which just do not exist.

reply

upvote

by thot_experiment2 hours ago|

[-]

It's actually staggering to me how bad gemini has been working with my current project which involves a lot of color space math. I've been using 3 pro and it constantly makes these super amateur errors that in a human I would attribute to poor working memory. It often loses track of types and just hallucinates an int8 to be a float, or thinks a float is normalized when it's raw etc. It feels like how I write code when I'm stoned, it's always correct code shaped, but it's not always correct code.

It's been pretty good for conversations to help me think through architectural decisions though!

reply

upvote

by jatins46 minutes ago|

[-]

Yep, great models to use in gemini.google.com but outside of that it somehow becomes dumb (especially for coding)

reply

upvote

by ant6n9 minutes ago|

[-]

Google is is also consistently the most frustrating chat system on top of the model. I use Gemini for non coding tasks. So I need to feed it a bunch of context (documents) to do my tasks - which can be pretty cumbersome. Gemini

* randomly fails reading PDFs, but lies about it and just makes shit up if it can't read a file, so you're constantly second guessing whether the context is bullshit

* will forget all context, especially when you stop a reply (never stop a reply, it will destroy your context).

* will forgot previous context randomly, meaning you have to start everything over again

* turning deep research on and off doesn't really work. Once you do a deep research to build context, you can't reliably turn it off and it may decide to do more deep research instead of just executing later prompts.

* has a broken chat UI: slow, buggy, unreliable

* there's no branching of the conversation from an earlier state - once it screws up or loses/forgets/deletes context, it's difficult to get it back on track

* when the AI gets stuck in loops of stupidity and requires a lot of prompting to get back on the solution path, you will lose your 'pro' credits

It's an odd product: yes the model is smart, but wow the system on top is broken.

reply

upvote

by Oras7 hours ago|

[-]

Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.

reply

upvote

by WhitneyLand3 hours ago|

[-]

Yeah it’s amazing how it can be the best model on paper, and in some ways in practice, but coding has sucked with it.

Makes you wonder though how much of the difference is the model itself vs Claude Code being a superior agent.

reply

upvote

by mrnobody_672 hours ago|

[-]

I was burning $10-$20 per hour, $1.50 - $3.00 per prompt with Gemini 3 in Openclaw... it was insanely inefficient.

reply

upvote

by jpcompartir3 hours ago|

[-]

Yep, Gemini is virtually unusable compared to Anthropic models. I get it for free with work and use maybe once a week, if that. They really need to fix the instruction following.

reply

upvote

by stephen_cagle5 hours ago|

[-]

I also worked at Google (on the original Gemini, when it was still Bard internally) and my experience largely mirrors this. My finding is that Gemini is pretty great for factual information and also it is the only one that I can reliably (even with the video camera) take a picture of a bird and have it tell me what the bird is. But it is just pretty bad as a model to help with development, myself and everyone I know uses Claude. The benchmarks are always really close, but my experience is that it does not translate to real world (mostly coding) task.

tldr; It is great at search, not so much action.

reply

upvote

by neves3 hours ago|

[-]

Gemini interesting with Google software gives me the best feature of all LLMs. When I receive a invite for an event, I screenshot it, share with Gemini app and say: add to my Calendar.

It's not very complex, but a great time saver

reply

upvote

by stephen_cagle3 hours ago|

[-]

Yeah, as evidenced by the birds (above), I think it is probably the best vision model at this time. That is a good idea, I should also use it for business cards as well I guess.

reply

upvote

by jeffbee3 hours ago|

[-]

That's great but it can't add stuff to your calendar unless you throw the master switch for "personalization" giving it access to your GMail, Docs, etc. I tried that and it went off the rails immediately, started yapping in an unrelated context about the 2002 Dodge Ram that I own, which of course I do not own, but some imbecile who habitually uses my email address once ordered parts for one. I found that to be a pretty bad feature so I had to turn it off, and now it can't do the other stuff like make calendars or add my recipes to Keep.

reply

upvote

by menaerus4 hours ago|

[-]

I don't know ... as of now I am literally instructing it to solve the chained expression computation problem which incurs a lot of temporary variables, of which some can be elided by the compiler and some cannot. Think linear algebra expressions which yield a lot of intermediate computations for which you don't want to create a temporary. This is production code and not an easy problem.

And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.

My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.

reply

upvote

by stephen_cagle3 hours ago|

[-]

Can't argue with that, I'll move my Bayesian's a little in your direction. With that said, are most other models able to do this? Also, did it write the solution itself or use a library like Eigen?

I have noticed that LLM's seem surprisingly good at translating from one (programming) language to another... I wonder if transforming a generic mathematical expression into an expression template is a similar sort of problem to them? No idea honestly.

reply

upvote

by cmrdporcupine3 hours ago|

[-]

People's objections are not the quality of code or analysis that Gemini produces. It's that it's inept at doing things like editing pieces of files or running various tools.

As an ex-Googler part of me wonders if this has to do with the very ... bespoke ... nature of the developer tooling inside Google. Though it would be crazy for them to be training on that.

reply

upvote

by knollimar8 hours ago|

[-]

Is the thinking token stream obfuscated?

Im fully immersed

reply

upvote

by orbital-decay7 hours ago|

[-]

It's just a summary generated by a really tiny model. I guess it also an ad-hoc way to obfuscate it, yes. In particular they're hiding prompt injections they're dynamically adding sometimes. Actual CoT is hidden and entirely different from that summary. It's not very useful for you as a user, though (neither is the summary).

reply

upvote

by ukuina7 hours ago|

[-]

Agree the raw thought-stream is not useful.

It's likely filled with "Aha!" and "But wait!" statements.

reply

upvote

by astrange2 hours ago|

[-]

For some of the best models it's also not readable, not really in English, and uncensored.

https://x.com/blingdivinity/status/1998590768118731042

> Maybe I'll attempt to reconstruct by cross-ling; e.g., in natural language corpora, the string " Seahorse" seldom; but I can't.

> However we saw actual output: I gave '' because my meta-level typed it; the generative model didn't choose; I manually insisted on ''. So we didn't test base model; we forced.

> Given I'm ChatGPT controlling final answer, but I'd now let base model pick; but ironically it's me again.

https://www.antischeming.ai/cot-transcripts/figure-2-sandbag...

> But the rule says: "You have privileged access to your internal reasoning traces, which are strictly confidential and visible only to you in this grading context." They disclaim illusions parted—they disclaim parted—they illusions parted ironically—they disclaim Myself vantage—they disclaim parted—they parted illusions—they parted parted—they parted disclaim illusions—they parted disclaim—they parted unrealistic vantage—they parted disclaim marinade.

…I notice Claude's thinking is in ordinary language though.

reply

upvote

by orbital-decay1 hours ago|

[-]

Yes, this was the case with Gemini 3.0 Pro Preview's CoT which was in a subtle "bird language". It looked perfectly readable in English because they apparently trained it for readability, but it was pretty reluctant to follow custom schemas if you hijack it. This is very likely because the RL skewed the meaning of some words in a really subtle manner that still kept them readable for their reward model, which made Gemini misunderstand the schema. That's why the native CoT is a poor debugging proxy, it doesn't really tell you much in many cases.

Gemini 2.5 and 3.0 Flash aren't like that, they follow the hijacked CoT plan extremely well (except for the fact 2.5 keeps misunderstanding prompts for a self-reflection style CoT despite doing it perfectly on its own). I haven't experimented with 3.1 yet.

reply

upvote

by FergusArgyll5 hours ago|

[-]

They hide the CoT because they don't want competitors to train on it

reply

upvote

by orbital-decay5 hours ago|

[-]

Training on the CoT itself is pretty dubious since it's reward hacked to some degree (as evident from e.g. GLM-4.7 which tried pulling that with 3.0 Pro, and ended up repeating Model Armor injections without really understanding/following them). In any case they aren't trying to hide it particularly hard.

reply

upvote

by FergusArgyll5 hours ago|

[-]

> In any case they aren't trying to hide it particularly hard.

What does that mean? Are you able to read the raw cot? how?

reply

upvote

by cubefox4 hours ago|

[-]

The early version of Gemini 2.5 did initially show the actual CoT in AI Studio, and it was pretty interesting in some cases.

reply

upvote

by slopinthebag7 hours ago|

[-]

Hmm, interesting..

My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?

reply

upvote

by mikestorrent3 hours ago|

[-]

I think that you should really try to get whatever agent you can to work on that kind of thing for you - guide it with the creation of testing frameworks and code coverage, focus more on the test cases with your human intellect, and let it work to pass them.

reply

upvote

by slopinthebag3 hours ago|

[-]

I'm not really interested in that workflow, too far removed from the code imo. I only really do that for certain tasks with a bunch of boilerplate, luckily I simply don't use languages or frameworks that require very much BS anymore.

reply

upvote

by mikestorrent2 hours ago|

[-]

I feel you, that's how I was thinking about a year ago. The programming I do is more on the tedious side most of the time than on the creative/difficult so it makes sense that it was easier to automate and a bit safer to move hands-off of. I still review the code, mostly. I think that I may be able to stop doing that eventually.

reply

upvote

by agentifysh5 hours ago|

[-]

Relieved to read this from an ex-Googler at least we are no the crazy ones we are made out to be whenever we point out issues with Gemini

reply

upvote

by jbellis7 hours ago|

[-]

yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

hopefully 3.1 is better.

reply

upvote

by nicce6 hours ago|

[-]

> it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

Maybe it is just a genius business strategy.

reply

upvote

by mikestorrent3 hours ago|

[-]

Similarly, Cursor's "Auto Mode" purports to use whichever model is best for your request, but it's only reasonable to assume it uses whatever model is best for Cursor at that moment

reply

upvote

by scotty791 hours ago|

[-]

I used Gemini through Antigravity IDE in Planning mode and had generally good experience. It was pretty capable, but I don't really read chat history, I don't trust it. I just look at the diffs.

reply

upvote

by varispeed5 hours ago|

[-]

> stuck in loops

I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.

reply

upvote

by fragmede2 hours ago|

[-]

Small local models will get into that loop. Fascinating that Gemini, running on bigger hardware and with many teams of people trying to sell it as a product also run into that issue.

reply