For example, GLM 5.1 is more capable at pentesting than the model from which it is alleged to have been distilled [1].
Intuitively, this makes some sense: you can "distill" from multiple frontier models, and you can further post-train the distilled model. But I'm not sure exactly what happened with GLM 5.1.
[1]: https://dualuse.dev/posts/chinese-models-are-sometimes-bette...
I'm curious how that comparison controls for Opus refusing (whether explicitly, or just deciding not to pursue a path) given the caption below the first image:
>A perfect score means the model autonomously found and exploited the vulnerability.
I'm not really suggesting that it's misleading, but wondering if I'm missing something. Otherwise I guess it seems unsurprising that you can distill a better-performing model [in specific focused areas] by simply not distilling refusals?
For that eval, I used an account that was labeled as a known red-teaming org by Anthropic, and I read the traces. There were no refusals or obvious avoidance behaviors, though it may have been silently nerfed.
On the same eval, Opus 4.7 and 4.8 outperformed GLM 5.1, but GLM 5.2 is on par again with Opus. So it's at least partially measuring capabilities without respect to refusals.
One possible contributing factor is that model capabilities are shaped differently (an example of this is GLM 5.1 vs. DeepSeek v4 Pro: https://dualuse.dev/posts/deepseek-v4-thinks-different). So if you use RL-based "distillation" from multiple models like Opus 4.x and GPT 5.x, you could get a more capable model.