The model could just as well learn to predict next token from gibberish text as long as there were some statistical gibberish regularities to learn. However, if you train it on real meaningful text then the statistical regularities it needs to learn (and will, thanks to gradient descent, and the capable architecture) will be those reflecting "token relationships" - grammar, semantics, etc.
So, you can say the "token relationships" (incl word meanings) are reflected in the statistical regularities of the training data, and the model architecture and training algorithm are just very capable of learning those regularities whatever they may be.
You can consider it related to Word2Vec word embeddings, which are based on the idea that the meaning of words comes from how they are used, which to a first approximation can be implemented by considering the meaning of words to be defined by the words they appear next to(!), which is what the Word2Vec embedding training algorithm does, and famous examples such as "(king - man) + woman = queen" prove that this is in fact learning the meanings of words.
It's the same thing here, you randomly try various token-relationship values and the ones which are slightly better will be favoured.