I think it's more subtle than that. IIUC the tokens were all present for the purpose of computing the output and the score is based on the output. It's only the weight update where some of the tokens get ignored. So the learning is lossy but the inference driving the learning is not.
Rather than a book that's missing words it's more like a person with a minor learning disability that prevents him from recalling anything perfectly.
However it occurs to me that data augmentation could easily break the scheme if care isn't taken.