I have a lot of fun with the local models and seeing what they can do.
I appreciate the SOTA models even more after my local experiments. The local models are really impressive these days, but the gap to SOTA is huge for complex tasks.
The economics of running SOTA locally just does not make sense, because you’re not using it 24/7 at 80%+ utilization while the cloud based providers can.
The IQ2 quants that fit into 128GB machines are very degraded.
Of course then you'll be asking "uhh lemme know when Opus 6.8 level performance is available locally". People are never happy.
Gemma 4 and Qwen 3.6 are legit beast models that would steamroll every API offering from 2 years ago.