upvote
The problem is using a language model to assess images.

Probably 80% of "LLM's are below expectation" complaints (from the general population) involves some form of image analyses.

Image tokenization is hard because unlike language tokenization, where every token is extremely dense with meaning, image tokens tends to be meaningless or irrelevant but are processed all the same.

Give an SOTA LLM a picture of toothpicks and ask it to move one to make a square, and it will probably struggle and fumble it. But give a mid-size LLM from 2 years ago the same problem in verbal form, and it will nail it almost every time.

That takeaway is, do everything you can to avoid having the LLM need to rely on images for the answer.

reply
I thought all the recent models are "multimodal"? Is the image part just sticking an image recognizer in front of the text model?
reply
Most of those videos are chatGPT voice mode, which still used gpt 4o last time I checked. it is far from SOTA.
reply
Like coding, creating images or text, maybe the alternative of doing it yourself is too easy or enjoyable for you. Don't expect that will be true for everyone.
reply
Did you reply to the wrong person? What are you even trying to say here?
reply
You say you don't trust it but whats your alternative assuming you lost your vision?
reply