undefined

points

by karmasimida5 hours ago |

comments

by onlyrealcuzzo5 hours ago|

[-]

Because Search is not agentic.

Most of Gemini's users are Search converts doing extended-Search-like behaviors.

Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

by Macha4 hours ago|

parent|

[-]

> Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

I do wonder what percentage of revenue they are. I expect it's very outsized relative to usage (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

by curly64 hours ago|

parent|

[-]

> Most agent actions on our public API are low-risk and reversible. Software engineering accounted for nearly 50% of agentic activity, but we saw emerging usage in healthcare, finance, and cybersecurity.

via Anthropic

https://www.anthropic.com/research/measuring-agent-autonomy

this doesn’t answer your question, but maybe Google is comfortable with driving traffic and dependency through their platform until they can do something like this

https://www.adweek.com/media/google-gemini-ads-2026/

by onlyrealcuzzo4 hours ago|

parent|

prev|

[-]

> (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

Nobody is paying for Search. According to Google's earnings reports - AI Overviews is increasing overall clicks on ads and overall search volume.

by bayindirh3 hours ago|

parent|

[-]

So, apparently switching to Kagi continues to pay in dividends, elegantly.

No ads, no forced AI overview, no profit centric reordering of results, plus being able to reorder results personally, and more.

by alphabetting5 hours ago|

prev|

[-]

the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big from 3.0 to 3.1.

For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:

1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%

by kakugawa2 hours ago|

parent|

[-]

In mid-2024, Anthropic made the deliberate decision to stop chasing benchmarks and focus on practical value. There was a lot of skepticism at the time, but it's proven to be a prescient decision.

by girvo2 hours ago|

parent|

prev|

[-]

Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.

I'll withhold judgement until I've tried to use it.

by avereveard56 minutes ago|

parent|

[-]

What's your opinion of glm5 if you had a chance to use it

by metadat1 hours ago|

parent|

prev|

[-]

Ranking Codex 5.2 ahead of plain 5.2 doesn't make sense. Codex is expressly designed for coding tasks. Not systems design, not problem analysis, and definitely not banking, but actually solving specific programming tasks (and it's very, very good at this). GPT 5.2 (non-codex) is better in every other way.

by nl1 hours ago|

parent|

[-]

Codex has been post-trained for coding, including agentic coding tasks.

It's certainly not impossible that the better long-horizon agentic performance in Codex overcomes any deficiencies in outright banking knowledge that Codex 5.2 has vs plain 5.2.

by 306bobby1 hours ago|

parent|

prev|

[-]

It could be problem specific. There are certain non program things that opus seems better than sonnet at as well

by 306bobby1 hours ago|

parent|

prev|

[-]

Swapped sonnet and opus on my last reply, oops

by blueaquilae2 hours ago|

parent|

prev|

[-]

Marketing team agree with benchmark score...

by HardCodedBias3 hours ago|

parent|

prev|

[-]

LOL come on man.

Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).

If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.

by drivebyhooting1 hours ago|

parent|

[-]

You can’t put Gemini and Meta in the same sentence. Llama 4 was DOA, and Meta has given up on frontier models. Internally they’re using Claude.

by not_ai34 minutes ago|

parent|

[-]

After spending all that money and firing a bunch of people? Is the new group doing anything at this point?

by hintymad2 hours ago|

prev|

[-]

My guess is that Gemini team didn't focus on the large-scale RL training for the agentic workload. And they are trying to catch up with 3.1.

by swftarrow2 hours ago|

prev|

[-]

I suspect a large part of Google's lag is due to being overly focused on integrating Gemini with their existing product and app lines.

by 2 hours ago|

prev|

[-]

deleted

by renegade-otter2 hours ago|

prev|

[-]

It's like anything Google - they do the cool part and then lose interest with the last 10%. Writing code is easy, building products that print money is hard.

by miohtama2 hours ago|

parent|

[-]

One does not need products if you have monopoly on search

by margorczynski2 hours ago|

parent|

[-]

That monopoly is worth less as time goes by and people more and more use LLMs or similar systems to search for info. In my case I've cut down a lot of Googling since more competent LLMs appeared.

by ionwake5 hours ago|

prev|

[-]

Can you explain what you mean by its bad at agentic stuff?

by karmasimida5 hours ago|

parent|

[-]

Accomplish the task I give to it without fighting me with it.

I think this is classic precision/recall issue: the model needs to stay on task, but also infer what user might want but not explicitly stated. Gemini seems particularly bad that recall, where it goes out of bounds

by ionwake1 hours ago|

parent|

[-]

cool thanks for the explanation