undefined

points

[-]

That's not how it works though. When you prepare the conversations for distillation, it's the most trivial and obvious first step to replace "Qwen" with "Claude" and vice versa. I doubt they'd simply forget to do it.

A model may misidentify itself due to the surrounding context. When a model is about to answer "I'm ...", what follows is a sorted list of probabilities for what the next token should be. In most models it's usually a list of popular model names: say, in the list, first comes Claude, then Qwen, then ChatGPT etc. Usually the "Claude" token would be the most probable token, say 70%. But if the surrounding context is in Chinese, the embeddings for "something to do with China" may nudge the combined embedding of the output token towards the "Qwen" embedding more ("China+Claude=Qwen" in the embedding space). Say, the probability for "Qwen" now becomes 60% instead of 10%.

If we also use high temperature for more "creativity", the token sampler now may choose "Qwen". It's not the most probable token still, but it was chosen because selecting the 2nd most probable token once in a while usually allows a model to explore unexpected "creative" paths, and 60% probability is good enough compared to 70%. It's basically a hallucination.

I once made an experiment: if I ban the word "Qwen" in the inference engine entirely, and ask Qwen "which model are you?", it happily starts announcing it's Claude 100% time, simply because "Claude" is the next most probable token after "Qwen" in this context.