I thought "thinking" is literally the model generating additional text in a human language that shows its "thought process". It's added to the model's context, which helps it reason better because it now has this self-generated context.
The "their own language" idea seems to come from some recent science fiction where LLMs develop their alien language and take over the world by 2037 or something.
Research only showed that thinking might be disconnected from the final output but in my experience they are very strongly correlated in recent models
It is trivial to regularly spot obvious contradictions and inconsistencies if you read carefully. For example I've encountered traces that amounted to "I can deduce X, therefore Y, so that means Z" but then the model turns around and outputs "the answer is W because X". It's even been demonstrated that having the model output placeholder tokens or other gibberish instead of "thoughts" still improves performance. However the thinking traces can still be useful to the end user regardless.
I think that for DeepSeek problem (thinking and replying in Chinese) everything is kinda simpler: in their official chat, they're probably using some kind of system prompt which is (probably) written in Chinese, so that's why model may prefer Chinese in it's output.
They occasionally show snippets of CoT in papers they write, e.g. for o3/o4/GPT5 models [1] or Claude 3.5 Haiku [2].
[1]: https://openai.com/index/evaluating-chain-of-thought-monitor... [2]: https://transformer-circuits.pub/2025/attribution-graphs/bio...
Or hallucinated
I use the API however, not the chat interface.
It also happened a handful of times with Anthropic models.