I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.
The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.
TPS = active weights in GB / your memory bandwidth.
That’s it for decode. That’s all.