undefined

points

by proxysna23 hours ago |

comments

by tandr22 hours ago|

[-]

What would be these additional vllm flags, if you don't mind sharing?

by proxysna16 hours ago|

parent|

[-]

This is from an example from my Nomad cluster with two a5000's, which is a bit different what i have at work, but it will mostly apply to most modern 24G vram nvidia gpu.

"--tensor-parallel-size", "2" - spread the LLM weights over 2 GPU's available

"--max-model-len", "90000" - I've capped context window from ~256k to 90k. It allows us to have more concurrency and for our use cases it is enough.

"--kv-cache-dtype", "fp8_e4m3", - On an L4 cuts KV cache size in half without a noticeable drop in quality, does not work on a5000, as it has no support for native FP8. Use "auto" to see what works for your gpu or try "tq3" once vllm people merge into the nightly.

"--enable-prefix-caching" - Improves time to first output.

"--speculative-config", "{\"method\":\"qwen3_next_mtp\",\"num_speculative_tokens\":2}", - Speculative mutli-token prediction. Qwen3.5 specific feature. In some cases provides a speedup of up to 40%.

"--language-model-only" - does not load vision encoder. Since we are using just the LLM part of the model. Frees up some VRAM.

by czl9 hours ago|

parent|

[-]

> "--speculative-config",

Regarding that last option: speculation helps max concurrency when it replaces many memory-expensive serial decode rounds with fewer verifier rounds, and the proposer is cheap enough. It hurts when you are already compute-saturated or the acceptance rate is too low. Good idea to benchmark a workload with and without speculative decoding.

by NwtnsMthd3 hours ago|

prev|

[-]

Just curious, what's your setup like? How do the devs interact with the model?

by PcChip21 hours ago|

prev|

[-]

question: why not use something like Claude? is it for security reasons?

by lambda20 hours ago|

parent|

[-]

Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.

I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?

by cyanydeez20 hours ago|

parent|

[-]

I think, we're seeing very clearly, the problem with the Cloud (as usual) is it locks you into a service that only functions when the Cloud provides it.

But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.

So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.

by proxysna17 hours ago|

parent|

prev|

[-]

We do make Claude and Mistral available to our developers too. But, like you said, security. I, personally, do not understand how people in tech, put any amount of trust in businesses that are working in such a cutthroat and corrupt environment. But developers want to try new things and it is better to set up reasonable guardrails for when they want to use these thing by setting up a internal gateway and a set of reasonable policies.

And the other thing is that i want people to be able to experiment and get familiar with LLM's without being concerned about security, price or any other factor.

by winrid7 hours ago|

parent|

[-]

Because it's a great tool and the second it's not we can just do what you're doing :)