Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.
DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]
Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.
[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...
And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.
You are conflating tokens with embeddings.
Tokens fit in a single word, modern gpt uses a vocabulary with 200k possible values, which would fit into 18 bits.
Have a good one
The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.
You can achieve this by changing the extension of an image file from .bmp to .txt
Guys, not to be mean, but maybe chill with the state of the art research and go back to studying fundamentals.