(emphasis mine)
> Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model.[2]
> As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same strategy for DeepSeek-V4 series without modification.[3]
[1]: https://arxiv.org/pdf/2412.19437#subsection.2.2
[2]: https://arxiv.org/pdf/2412.19437#subsubsection.5.4.3
[3]: https://arxiv.org/pdf/2606.19348v1#subsection.2.1
Side comment: I feel you may be too cynical towards your fellow commenters.