undefined

points

[-]

Some obvious examples of why you'd want to spend the capital on this would be, for example, making some kind of autonomous system which needs to be periodically be offline, or you need complete confidentiality of what you're using the model for, etc.

To be cost effective with inference providers, you have to find some way to be using it 24/7.

by Der_Einzige5 hours ago|

prev|

[-]

The ecosystem for inference is centralized around a few core projects, i.e. vLLM, sglang, and llamacpp.

If they decided to collude, they could absolutely say "from now on you no longer have access to model X because you're an asshole"

The commercial inference offering are also downstream of one of those 3 projects (or trt-LLM if they're nvidia). It would impact Ollama, and fireworks, together, and everyone else.

Don't tempt fate.