upvote
> "--speculative-config",

Regarding that last option: speculation helps max concurrency when it replaces many memory-expensive serial decode rounds with fewer verifier rounds, and the proposer is cheap enough. It hurts when you are already compute-saturated or the acceptance rate is too low. Good idea to benchmark a workload with and without speculative decoding.

reply