upvote
The model is just trying to map from sequence to next token. You could say that it doesn't really care about the relationships between words/tokens - it is just being trained to learn the best attention/etc weights to make this mapping as accurate as possible.

The model could just as well learn to predict next token from gibberish text as long as there were some statistical gibberish regularities to learn. However, if you train it on real meaningful text then the statistical regularities it needs to learn (and will, thanks to gradient descent, and the capable architecture) will be those reflecting "token relationships" - grammar, semantics, etc.

So, you can say the "token relationships" (incl word meanings) are reflected in the statistical regularities of the training data, and the model architecture and training algorithm are just very capable of learning those regularities whatever they may be.

You can consider it related to Word2Vec word embeddings, which are based on the idea that the meaning of words comes from how they are used, which to a first approximation can be implemented by considering the meaning of words to be defined by the words they appear next to(!), which is what the Word2Vec embedding training algorithm does, and famous examples such as "(king - man) + woman = queen" prove that this is in fact learning the meanings of words.

reply
At a high level, the text samples are how the relationships are derived. If we treat text samples as sequences of tokens, then the sequences of tokens describe the joint distributions they occur together which confers the relationship between them. Iirc, this is related to the idea of the distributional hypothesis in NLP: the idea the semantics of words should be similar if they occur in similar situations.
reply
If I handed you thousands of documents which said “Jan-Michael Vincent” all over them, would you need to understand who that is in order to notice the relationship there?
reply
How does evolution learn the form-fitness relationship?

It's the same thing here, you randomly try various token-relationship values and the ones which are slightly better will be favoured.

reply