upvote
Because DDR3/4/5 dies are made to a price with half to three quarters of their IO pins shared between the dies in parallel on a rank of a channel, and for capacity often up to around 6 ranks per channel. E.g. high capacity server DDR4 memory, say on AMD SP3, may have 108 dies on each of 8 channels of a socket.

So if you can move complexity over to the controller you can spend 100:1 ratio in unit cost. So you get to make the memory dies very dumb by e.g. feeding a source synchronous sampling clock that's centered on writes and edge aligned on reads leaving the controller to have a DLL master/slave setup to center the clock at each data group of a channel and only retain a minimal integer PLL in the dies themselves.

reply
Because loading XMP profile won't suffice and they have to tune the parameters further to be able to actually run the sticks
reply
A large section of the article is dedicated to the answer for this question.
reply
Imprecision in manufacturing (adjust resistor values), different trace lengths (speed of light differences for parallel signals), etc... it's in the article.
reply