undefined

points

[-]

Even that would be more meaningful test. They basically coated the ball with a strong smell, then they prepped the dog with that smell, then set it loose in a 5x5 meter area.

"Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior")."

by kennywinker3 hours ago|

prev|

[-]

I don’t think mythos can ingest an entire codebase into context. So it’s spinning off sub-agents to process chunks. Which supports their thesis: the harness is the moat. The tooling is whats important, the model is far far less important.

by bhouston3 hours ago|

parent|

[-]

Mythos was clear it was one agent per chunk. But this positive confirming results do not actually disprove anytime with Mythos, because it is only one side of the discriminator challenge - you got positives, but we do not know your false positive rate and your false negative rate.

by kennywinker3 hours ago|

parent|

[-]

In TFA they talk a fair bit about how different models perform wrt false positives:

“The results show something close to inverse scaling: small, cheap models outperform large frontier ones.”