upvote
What's the downside? Don't they stop when they hit diminishing returns?
reply
Wouldn't the model start overfitting at some point? Degrading generalization for accuracy on the training set.
reply
You’d think so, but I haven’t seen it explicitly discussed in their papers, and nobody else that I know of trains on that many tokens
reply