Now I am still trying out all the models that dropped this month. I am running qwen 3.6 35 a3b on a 16gb vram rtx 4060 ti.
I wish I sprung for a 24gb vram card but I never thought the price difference would matter. It seems like it does and I bet in the future there will be more models at this size because this is crazy.
It's not as good as opus if you are doing completely hands off programming but it's completely fine for me. I mostly use it for auto complete or templating a class. Other people are using it for agentic workflows with success.
Check out /r/localllama for more experiences. My set up is not the best but it is working for me and is saving me money.
I've got a local setup too but unless you consider hardware zero cost, there is really no way to save money. The class of model you can run on <$5k of hardware is dirt cheap to run in the cloud (generating tokens 24/7 non-stop is a few dollars a day at most, possibly even less than the cost of electricity to do it at home).
They are trying to go public and will get absolutely bitchslapped by SOX.