undefined

points

[-]

Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability.

by withinrafael14 hours ago|

prev|

[-]

I've had lots of success with generating coordinates and answering questions using the UI-TARS model https://github.com/bytedance/UI-TARS.

by theturtletalks12 hours ago|

parent|

[-]

I’d also checkout midscene, you can set the model and UI-TARS works but you can also use qwen vision models and it works.

by cyanydeez16 hours ago|

prev|

[-]

This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable.

Have you tried doing a two step: review the image, then render a vector?

by julius16 hours ago|

parent|

[-]

Maybe there is a smart trick to get them to do the right thing, but the things I tried did not work.

At one point I had some smaller model draw bounding boxes around everything that looked interactable and labels like "e3" ... then asked the model to tell me "click on e3". Did not work in my tests was pretty much as bad as x,y.

by cyanydeez15 hours ago|

parent|

[-]

Yeah, I've held off on doing any kind of rag till there's models that properly handle layout detection and partitioning because it's so easy to generate shitty data if you're not properly attending to visual cues first before you slice up a document.