undefined

IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.

by scarmig3 hours ago|

prev|

[-]

> especially for biology where it doesn't refuse to answer harmless questions

Usually, when you decrease false positive rates, you increase false negative rates.

Maybe this doesn't matter for models at their current capabilities, but if you believe that AGI is imminent, a bit of conservatism seems responsible.

by Davidzheng6 hours ago|

prev|

[-]

I gather that 4.6 strengths are in long context agentic workflows? At least over Gemini 3 pro preview, opus 4.6 seems to have a lot of advantages

by verdverm6 hours ago|

parent|

[-]

It's a giant game of leapfrog, shift or stretch time out a bit and they all look equivalent

by nkzd5 hours ago|

prev|

[-]

Google models and CLI harness feels behind in agentic coding compared OpenAI and Antrophic

by simianwords7 hours ago|

prev|

[-]

The comparison should be with GPT 5.2 pro which has been used successfully to solve open math problems.