upvote
A token is probably not a single char, and an image is probably decomposed into tokens as well (and god knows how many tokens an image is decomposed into) which probably map to similar float-hungry vectors. Your counterargument could use a bit more flesh.

And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.

reply
I kinda wonder if it's extracting usable context from 2D proximity between lines? Normal text input wouldn't have that kind of information (though it could, and it's arguably just a lookahead/behind of N characters on average).
reply
>Text tokens are high-dimensional vectors,

You are conflating tokens with embeddings.

Tokens fit in a single word, modern gpt uses a vocabulary with 200k possible values, which would fit into 18 bits.

Have a good one

reply