undefined

points

by mrngld7 hours ago |

comments

by undecidabot5 hours ago|

[-]

It got 46.2 on DeepSWE in Z.ai's own run[1]. That would put it between Opus 4.7 xhigh and Opus 4.8 medium.

[1] https://z.ai/blog/glm-5.2

by mrngld14 minutes ago|

parent|

[-]

If that ends up being true, GPT5.5 at 70 (and presumably Fable a bit ahead of that) is still in a different league, which was partly my point. To listen to online chatter, GLM5.2 is a tectonic shift in the landscape. In reality, it's just interesting. Probably safe to bet once the DeepSWE benches all get fully updated it won't even be on the pareto frontier.

I'm not accusing anyone specifically, but I've noticed Chinese bots swamping certain YouTube channels that, for example, cover US defense industry news. They'll downplay any and all technical advances, play up China's dominance, US cowardice, etc. All very transparent. I suspect some of the online conversation about open Chinese models is driven by that. How often do you see people talking about Mistral or Trinity? Never. Because they don't play that game.

by lukewarm7075 hours ago|

prev|

[-]

with open models you can get a subscription with privacy, at the same cost as codex.

openai, google and anthropic subscriptions are not available with privacy.

looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.

so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.

by vadansky5 hours ago|

parent|

[-]

> with open models you can get a subscription with privacy

Unless you're running it locally, aren't you just trusting some other entity?

by conception1 hours ago|

parent|

[-]

While true - there are laws about saying you are doing the things you are doing, especially in certain regulated environments. If you are in the same country as the entity you are trusting, you have recourse if they are not living up to your trust usually in some form or another.

by yieldcrv5 minutes ago|

parent|

prev|

[-]

right, and on prem being an option is a god send, however you manage to do it

it's not a recommendation, its an option. if you don't have capital then it doesn't apply to you and move on. it wasn't an option for even people with capital.

come back in a few years when its more accessible

additionally I like that there are providers with faster special purpose processors for faster tokens/sec, all at different pricing strategies

so just pick something that matches your personal risk tolerance

by lukewarm7074 hours ago|

parent|

prev|

[-]

correct, you are trusting another entity.

however the legal terms are different, openai reads your data. they store it for 30 days, but of course once it hits the disk you can keep as long as you like in a civil case like nyt v openai.

the same for google and anthropic. so, it's not always nice if someone is paid to read your data for safety. people upload sensitive matters, personal videos and so on.

i wouldn't prioritise it myself but you can also know that the data will all come out in discovery if you are in a legal issue. maybe that's not important, but people thought it did matter to give some protections to patient records, legal advice and therapy. you upload that to gpt and it goes into discovery.

by ttul5 hours ago|

prev|

[-]

DeepSWE “feels” like the right benchmark in comparison to Artificial Analysis indices and other coding benchmarks. And by their metrics, GPT-5.5 is still king in token efficiency, speed, and overall intelligence per dollar.

https://deepswe.datacurve.ai/

Fable 5 is cool and all, but we have not yet seen GPT-5.6.

by cmrdporcupine7 hours ago|

prev|

[-]

I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.

It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

Having better luck with MiniMax M3, from a cost/benefit ratio.

by pjerem6 hours ago|

parent|

[-]

I really like DeepSeek V4 Pro. It's pretty smart and I get so much usage out of it on a $20 Ollama cloud plan.

With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.

by zooming7 hours ago|

parent|

prev|

[-]

Try MiMo-2.5, I'm having astonishing success with it in opencode for cents per day. Not even the pro model.

by spelk3 hours ago|

parent|

[-]

I've found MiMo-2.5 is fun for front-end design since you can use its multimodal capabilities to drop in whatever it produced and correct it for you.

by re-thc5 hours ago|

parent|

prev|

[-]

> I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

GPT can find fault in everything and anything including its own work.

by gbingles4 hours ago|

parent|

[-]

AI review generally will find fault in anything. Any non-trivial code has multiple solutions with different tradeoffs. Any code can be over-engineered for theoretical edge cases and future use cases you don't need. No matter which solution you pick you can always at a minimum say that some alternative just looks and reads better.

Code is somewhat artistic. If you don't have well defined standards and priorities, the AI review cycle can spiral infinitely figuratively debating what makes art good, and your code will be no better for it.

by cmrdporcupine3 hours ago|

parent|

[-]

This is correct, but I'd say there's something beyond that that's more specific about Codex + GPT models though. They've done some sort of training that makes it far more diligent about seeking out data races, unhandled errors / negative cases, and missing test coverage than the other models I've played with. It also seems more prone to testing its hypothesis.

This makes it slower to work with for prototyping, and it will, if not properly disciplined, litter your code with "legacy adapters" and "bridge code" and temporary incremental refactoring steps [arguably not terrible for work in real commercial software projects]. And it will create too many unit & integration tests, if you're not careful.

But it does, in my opinion, tend to produce more reliable software and I trust it far more than I did when I was working in Claude.

When I could afford it, I had both plans running, Claude to produce new features, and then Codex to brutally critique it battle test it, sharpen the edges, and produce better tests, and this flow went extremely well.

Now I just work with Codex and various open models.

by cmrdporcupine4 hours ago|

parent|

prev|

[-]

That's what I love about it, and I wish I could find an open model that was as diligent.

Somehow it's just way more careful than the others, and also much better at empirical verification of its hypothesis, writing tests, etc. I am assuming a lot of RL done on that kind of flow, and on seeking out negative cases, failure points, race conditions.