What makes one cringe and another recognize as familiar and comfortable is also pretty subtle and hard to define. These things need nuanced descriptions and examples to actually get right, and it's in understanding those nuances and figuring out the register of the examples that Grok outshines the others.
And why are you comparing to gpt-4.1? (As opposed to one of the 6? model releases since then - would have expected gpt 5.5)
Here's an updated eval with the proper models https://a3bmfqfom3.evvl.io/
ChatGPT sounds fake / formal phrasing (for the specific close friend context) and has em-dashes and uses capitalization. Hence, ChatGPT does not, imo grok the assignment ;)
There's a lot of "tone" in it as she's not trying to anger these folks, but also it's quite serious, but also there's just everything else happening in medicine.
Feels like a great use.