undefined

points

[-]

Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.

by matheusmoreira8 hours ago|

parent|

[-]

> LLM provider that doesnt store or sell their queries

> As long as that is allowed to happen

It won't be. Only we can provide that, and only for ourselves.

by eventualcomp20 hours ago|

prev|

[-]

Where is $50k coming from again?

by stingraycharles20 hours ago|

parent|

[-]

That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.

by cogman1020 hours ago|

parent|

[-]

The hardware requirements aren't evolving and the local models have only been improving.

It's not like you'd lose capabilities, if anything this solution just gets better with time.

by chatmasta19 hours ago|

parent|

[-]

If the newer models require more/better hardware then you’ll lose capabilities.

I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.

by cogman1018 hours ago|

parent|

[-]

The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.

The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.

by dannyw16 hours ago|

parent|

[-]

Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.

TPS = active weights in GB / your memory bandwidth.

That’s it for decode. That’s all.

by Tepix5 hours ago|

parent|

prev|

[-]

$50K seems low if you want to run, say, GLM 5.2 4bit fast enough for a team for devs.

You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.

by theYipster1 hours ago|

parent|

[-]

You don't need all of the model in VRAM. 1 or 2 RTX Pro 6000s will do. $50K will get you there very nicely, and on a 1600 watt PSU if you go for the MAX-Q versions. (The same wattage PSU I'm typing this on, and have been using over the last 5 years.)

by cogman1020 hours ago|

parent|

prev|

[-]

As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.