I think that much of the visual gap is because what to attend to in images is less structured. Anecdotally small qwen finetunes (ie less than 10B) take task accuracy from sub 30% on FMs to 90%. We have sold some of these for outcome based back office tasks.
I think we’ll see a lot of specialized VLMs that provide real value.