There is no such projection step. The array of x numbers is the token. For text, there is a one-to-one correspondence between the textual representation of a token, its index in the vocabulary of the model, and the array of x numbers that is fed into the linear algebra of the model, so people often equivocate between them; but for images or sound, there is no discrete vocabulary and no textual representation, only the array of x numbers.