The size is indeed smaller, because text tokens and image tokens are embedded as vectors of the same size, but text tokens typically only cover a few characters, while image tokens typically cover many pixels, so many that you can fit more characters in there. So the same text takes up fewer tokens as an image, and hence requires less time and memory to process.
You could also imagine models where text tokens cover many characters and image tokens just a few pixels, which would invert the relationship, but this is typically suboptimal for the applications people have in mind when they train a model.