undefined

points

by serf8 hours ago |

comments

by nwah18 hours ago|

[-]

That was the point. Look up Goodhart's Law

by AndrewKemendo7 hours ago|

prev|

[-]

I have passed multiple CI polys

A poly is only testing one thing: can you convince the polygrapher that you can lie successfully

by madihaa8 hours ago|

prev|

[-]

A polygraph measures physiological proxies pulse, sweat rather than truth. Similarly, RLHF measures proxy signals human preference, output tokens rather than intent.

Just as a sociopath can learn to control their physiological response to beat a polygraph, a deceptively aligned model learns to control its token distribution to beat safety benchmarks. In both cases, the detector is fundamentally flawed because it relies on external signals to judge internal states.