undefined

points

[-]

They have had the best math models for about a year most folks just didn't know about it. You can't find inference on APIs, but I run these at home, this is also the advantage of open models.

https://huggingface.co/deepseek-ai/DeepSeek-Math-V2 https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

by ozgune4 hours ago|

prev|

[-]

I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 across the same AI benchmarks. All results are for Max editions, except for Kimi.

Summary: Opus 4.6 forms the baseline all three are trying to beat. DeepSeek V4-Pro roughly matches it across the board, Kimi K2.6 edges it on agentic/coding benchmarks, and Opus 4.7 surpasses it on nearly everything except web search.

DeepSeek V4-Pro Max shines in competitive coding benchmarks. However, it trails both Opus models on software engineering. Kimi K2.6 is remarkably competitive as an open-weight model. Its main weakness is in pure reasoning (GPQA, HMMT) where it trails Opus.

Speculation: The DeepSeek team wanted to come out with a model that surpassed proprietary ones. However, OpenAI dropped 5.4 and 5.5 and Anthropic released Opus 4.6 and 4.7. So they chose to just release V4 and iterate on it.

Basis for speculation? (i) The original reported timeline for the model was February. (ii) Their Hugging Face model card starts with "We present a preview version of DeepSeek-V4 series". (iii) V4 isn't multimodal yet (unlike the others) and their technical report states "We are also working on incorporating multimodal capabilities to our models."

by alansaber1 hours ago|

prev|

[-]

Very interesting. I wonder how much of this is due to the context length. I am unclear on the implementation strategy, you ran this problem as a 1-shot using chat mode, or using each on an agent harness?

by segmondy56 minutes ago|

parent|

[-]

Has nothing to do with context length, they have experience training math models, they have a model that would take gold in IMO and a lean prover. Both have been out for almost a year.

by lifty6 hours ago|

prev|

[-]

Wondering how gpt 5.5 is doing in your test. Happy to hear that DeepSeek has good performance in your test, because my experience seems to correlate with yours, for the coding problems I am working on. Claude doesn't seem to be so good if you stray away from writing http handlers (the modern web app stack in its various incarnations).

by hodgehog115 hours ago|

parent|

[-]

Very cool to hear there is agreement with (probably quite challenging?) coding problems as well.

Just ran a couple of them through GPT 5.5, but this is a single attempt, so take any of this with a grain of salt. I'm on the Plus tier with memory off so each chat should have no memory of any other attempt (same goes for other models too).

It seems to be getting more of the impressive insights that Gemini got and doing so much faster, but I'm having a really hard time getting it to spit out a proper lengthy proof in a single prompt, as it loves its "summaries". For the random matrix theory problems, it also doesn't seem to adhere to the notation used in the documents I give it, which is a bit weird. My general impression at the moment is that it is probably on par with Gemini for the important stuff, and both are a bit better than DeepSeek.

I can't stress how much better these three models are than everything else though (at least in my type of math problems). Claude can't get anything nontrivial on any of the problems within ten (!!) minutes of thinking, so I have to shut it off before I run into usage limits. I have colleagues who love using Claude for tiny lemmas and things, so your mileage may vary, but it seems pretty bad at the hard stuff. Kimi and GLM are so vague as to be useless.

by lifty4 hours ago|

parent|

[-]

My work is on a p2p database with quite weird constraints and complex and emergent interactions between peers. So it's more a system design problem than coding. Chatgpt 5.x has been helping me close the loop slowly while opus did help me initially a lot but later was missing many of the important details, leading to going in circles to some degree. Still remains to be seen if this whole endeavour will be successful with the current class of models.

by nibbleyou6 hours ago|

prev|

[-]

Curious to know what kind of problems you are talking about here

by hodgehog116 hours ago|

parent|

[-]

I don't want to give away too much due to anonymity reasons, but the problems are generally in the following areas (in order from hardest to easiest):

- One problem on using quantum mechanics and C*-algebra techniques for non-Markovian stochastic processes. The interchange between the physics and probability languages often trips the models up, so pretty much everything tends to fail here.

- Three problems in random matrix theory and free probability; these require strong combinatorial skills and a good understanding of novel definitions, requiring multiple papers for context.

- One problem in saddle-point approximation; I've just recently put together a manuscript for this one with a masters student, so it isn't trivial either, but does not require as much insight.

- One problem pertaining to bounds on integral probability metrics for time-series modelling.

by MinimalAction31 minutes ago|

parent|

[-]

Regarding the first problem: are you looking at NCP maps for non-Markovian processes given you mention C*-algebra? Or is it more of a continuous weak monitoring of a stochastic system that results in dynamics with memory effects?

I'd be very curious to know how any LLMs fare. I completely understand if you don't want to continue the discussion because of anonymity reasons.

by pm2r6 hours ago|

parent|

prev|

[-]

It would be wonderful to have a deeper insight, but I understand that you can disclose your identity (I understand that you work in applied research field, right ? )

by hodgehog115 hours ago|

parent|

[-]

Yes, I do mostly applied work, but I come from a background in pure probability so I sometimes dabble in the fundamental stuff when the mood strikes.

Happy to try to answer more specific questions if anyone has any, but yes, these are among my active research projects so there's only so much I can say.