undefined

points

[-]

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

by thevinter3 hours ago|

prev|

[-]

Are you intentionally keeping the benchmarks private?

by XCSme3 hours ago|

parent|

[-]

Yes.

I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.

I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.

by XCSme3 hours ago|

prev|

[-]

Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not