And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.
You are conflating tokens with embeddings.
Tokens fit in a single word, modern gpt uses a vocabulary with 200k possible values, which would fit into 18 bits.
Have a good one