undefined

points

[-]

Ollama Cloud has a $20 a month subscription. They say they retain 0 information. And rather than token based billing, it's GPU time billing.

by throwawayffffas11 hours ago|

prev|

[-]

Kind of, you need at least 256 gb of vram and 24-40 gb of vram to run the 2bit quantization, because it's a moe you just need the expert to fit in vram to get significant improvement over a pure CPU setup. At 2bits though expect significant quality loss.

by Roark666 hours ago|

parent|

[-]

2bits is a joke for serious work. You'd be better with Qwen3.6 under 30G probably.

But there are EU only providers for GLM5.2. For example tensorx. Depending on your definition of "secure" it may be acceptable.

by throwawayffffas5 hours ago|

parent|

[-]

> 2bits is a joke for serious work.

I have not tried it but I will take your word on it. I don't think Qwen3.6 cuts it for large scale coding work. Reading issues, reading code sure, but biting into large issues no, it goes off the track consistently.

Depending on budget it may also be affordable to spin up servers to run it on demand.

by villish12 hours ago|

prev|

[-]

You'd need to multiply that $10k by 8 minimum.

by naasking4 hours ago|

parent|

[-]

Why? 4 x DGX sparks should be enough. That's way less than $80k.

by throwawayffffas3 hours ago|

parent|

[-]

From a quick google search a DGX spark seems to decode Llama 3.1 70B (FP8) at 2 tokens per second. I would expect the performance on a 768B parameter model spread across 4 to be significantly lower even though its a mixture of experts.

For real work anything below 60 tokens per second is essentially unusable. That's not taking into account the prompt filling, Llama 3.1. 70b on DGX spark runs at about 800 tps running at that speed prompt filling a 512k context takes like 11 minutes.

by 4 hours ago|

parent|

prev|

[-]

deleted