upvote
Given the way the test was structured it does line up.

https://arxiv.org/abs/2503.23674

reply
Surprisingly good. I wonder how they would have done without the 5 minute limit on conversations (average of 8 messages per convo per the study)
reply