undefined

upvote

points

by leoncos3 days ago |

upvote

by nicolailolansen14 hours ago|

[-]

Whenever you can run a model like Nano Banana or other vision-LLM with the same compute and time performance/restrictions as an OpenCV or YOLO call, you can make that comparison. Until then, I would not call YOLO and OpenCV outdated, it's simply wrong. There's a time and place for big V-LLMs just as there is a time and place for more "traditional" computer vision methods.

reply

upvote

by wongarsu13 hours ago|

[-]

I can get great results from a YOLO model with 30M to maybe 300M params. To get decent CV from a LLM 8B params is the absolute minimum, closer to 30B for interesting tasks

I might be on board about LLMs being the future of OCR (though many would disagree), but for general CV they are very inefficient for very limited benefit

reply

upvote

by IanCal13 hours ago|

[-]

They can however be extremely useful for curating training data. Also things like SAM and the DINO (/grounding dino) models.

Also if they are better then you can also have a flow that’s cheap model -> marginal cases go to more complex thing (and a chain of these).

The yolo models are really shockingly good for their cost and how well they can work with not much training data as well.

reply

upvote

by charcircuit10 hours ago|

[-]

>for very limited benefit

Due to how simple they are to work with they will become popular. Compare NLP before and after GPT-3. GPT-3 majorly brought down the complexity and skill needed for doing NLP tasks even if traditional NLP is much much faster. Ultimately ease of development will win out and the industry will work towards optimizing running such LLMs to make it cheap enough to run.

reply

upvote

by regularfry13 hours ago|

[-]

I've built hardware with a pi zero 2 + pi cam running a mildly fine-tuned YOLO doing local-only object detection as a USB-OTG device, in a use case where any off-device API calls would have been totally unacceptable, and where the object detection was part of the human interaction loop with a hard ceiling of 300ms on the total interaction time of which the object detection was only one process among many.

We're not going to fit Nano Banana or anything like it on a device with 512MB RAM and a GPU old enough to be irrelevant, and again, API calls just aren't on the menu.

reply

upvote

by Hendrikto9 hours ago|

[-]

> API calls just aren't on the menu

Even if they were an option, your 300ms latency requirement would exclude them anyway.

reply

upvote

by mirsadm13 hours ago|

[-]

That is a very uninformed view. Real time CV is not going to be doing that anytime soon.

reply

upvote

by sebmellen13 hours ago|

[-]

Great, let me know when those models can run on-server and process/analyze streams of ID images with less than 100ms of latency. You’ll need to make sure you have a massive set of training data including all manner of slightly blurred and slightly distorted ID cards

reply

upvote

by _the_inflator10 hours ago|

[-]

Exactly, and all on an embedded system with quite restrictive settings and no overclocked Intel lastest generation combined with NVIDIA's 10k graphic cards.

reply

upvote

by charcircuit10 hours ago|

[-]

Embedded systems can make network calls to powerful, GPU equipped servers.

reply

upvote

by ceejayoz8 hours ago|

[-]

Sure. Claude does that. "Cogitated for 1m 50s" doesn't work for real-time applications.

reply

upvote

by charcircuit6 hours ago|

[-]

You can submit many queries in parallel to increase throughout. Smaller models and faster hardware can reduce the time per query too.

reply

upvote

by ceejayoz5 hours ago|

[-]

None of that gets you the 100ms response time the parent poster talked about, for something like "who is at my doorbell?" real-time uses.

reply

upvote

by sebmellen5 hours ago|

[-]

Ok. Claude will not work for this use case because none of the sample data (weirdly blurry ID images) is in the training data.

reply

upvote

by Chu4eeno8 hours ago|

[-]

They really shouldn't, though.

reply

upvote

by charcircuit6 hours ago|

[-]

It can offer a ton of user value. There is a whole industry built upon this idea, Internet of Things.

reply

upvote

by ceejayoz5 hours ago|

[-]

IoT wasn't not built on "send all the data off to a hosted GenAI". It predated them by quite a few years.

reply

upvote

by serf3 days ago|

[-]

do you realize how many edge or unconnected nodes do OpenCV work?

some SBC w/ an industrial camera that is doing pick-place or go/no-go operations on a conveyor belt against a singular object type doesn't need a huge image-gen/llm model governing it.

I mean have you even considered the kind of performance an opencv function can get w/ just mask-matching? I mean even with a fancy YOLO model these answers get thrown out in 1.5-50ms ; this is just a wholly different time scaling.

reply

upvote

by Qhemlomo11 hours ago|

[-]

100.000 pictures take a lot of time with LLMs.

Its a lot better, faster, cheaper to use LLMs for initial labeling together with hand finetuning and then training YOLO with this.

Training YOLO takes a few hours and is then very fast.

reply

upvote

by kryptiskt13 hours ago|

[-]

If I want to identify and measure the size of round things in my orange sorter machine, I shouldn't have to resort to an unnecessarily complicated solution just because some AI bros can't understand that not everything needs to be an AI model.

Like, the AI model tools already exist, all that would be accomplished if OpenCV pivoted would be to take it away for people who want to do low-level vision programming. It wouldn't add anything useful to the world, just destroy an excellent library.

reply

upvote

by _the_inflator10 hours ago|

[-]

"When I use..."

Dude, in business we think in terms of large numbers, internationally easily in billion times processing images. This wouldn't cut it.

Also, do you buy the mega expensive super individually designed shoes from the best shoemaker there is to march along though some dirt or simply stick to gumboots?

OpenCV is used behind the scenes for many of the fancy stuff those major AI provider pretend to do. Claude is a huge system and not a LLM anymore.

reply

upvote

by TZubiri14 hours ago|

[-]

I am confused, how can functions that output images help with functions that should take images as input?

reply

upvote

by taneq12 hours ago|

[-]

They’re multimodal LLMs trained for image generation. Turns out that if you want to generate images you gotta know what things look like.

reply

upvote

by TZubiri11 hours ago|

[-]

That's not helpful my brother. If you have details share them, if not, don't pretend you are more illuminated than me.

Is the image(text) function reversible? Or are they brute force searching a nearest neighbor like word2vec/hash brute forcing.

reply

upvote

by sorenjan10 hours ago|

[-]

Google recently released their paper "Image Generators are Generalist Vision Learners" about exactly this. They fine tuned Nano Banana pro into what they call Vision Banana which can do segmentation etc.

https://arxiv.org/abs/2604.20329

reply