I might be on board about LLMs being the future of OCR (though many would disagree), but for general CV they are very inefficient for very limited benefit
Also if they are better then you can also have a flow that’s cheap model -> marginal cases go to more complex thing (and a chain of these).
The yolo models are really shockingly good for their cost and how well they can work with not much training data as well.
Due to how simple they are to work with they will become popular. Compare NLP before and after GPT-3. GPT-3 majorly brought down the complexity and skill needed for doing NLP tasks even if traditional NLP is much much faster. Ultimately ease of development will win out and the industry will work towards optimizing running such LLMs to make it cheap enough to run.
We're not going to fit Nano Banana or anything like it on a device with 512MB RAM and a GPU old enough to be irrelevant, and again, API calls just aren't on the menu.
Even if they were an option, your 300ms latency requirement would exclude them anyway.
some SBC w/ an industrial camera that is doing pick-place or go/no-go operations on a conveyor belt against a singular object type doesn't need a huge image-gen/llm model governing it.
I mean have you even considered the kind of performance an opencv function can get w/ just mask-matching? I mean even with a fancy YOLO model these answers get thrown out in 1.5-50ms ; this is just a wholly different time scaling.
Its a lot better, faster, cheaper to use LLMs for initial labeling together with hand finetuning and then training YOLO with this.
Training YOLO takes a few hours and is then very fast.
Like, the AI model tools already exist, all that would be accomplished if OpenCV pivoted would be to take it away for people who want to do low-level vision programming. It wouldn't add anything useful to the world, just destroy an excellent library.
Dude, in business we think in terms of large numbers, internationally easily in billion times processing images. This wouldn't cut it.
Also, do you buy the mega expensive super individually designed shoes from the best shoemaker there is to march along though some dirt or simply stick to gumboots?
OpenCV is used behind the scenes for many of the fancy stuff those major AI provider pretend to do. Claude is a huge system and not a LLM anymore.
Is the image(text) function reversible? Or are they brute force searching a nearest neighbor like word2vec/hash brute forcing.