My understanding of pixel representation is: slice a grid in an image, each square slice gets projected into a number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric) and AGAIN gets passed into "position detector" which outputs a token representing that pixel/position. which gets passed into the lmm (at a significantly reduced/translated signal into token space).
First, before continuing: do I have that mostly correct?
There is no such projection step. The array of x numbers is the token. For text, there is a one-to-one correspondence between the textual representation of a token, its index in the vocabulary of the model, and the array of x numbers that is fed into the linear algebra of the model, so people often equivocate between them; but for images or sound, there is no discrete vocabulary and no textual representation, only the array of x numbers.