undefined

points

[-]

They work better for coding workloads. Essentially, the more regular the output, the more the faster model gets right, the less the big model has to do.

Writing tends to have more false positives. I haven't tried this particular one, however, but that is the general trend.

by gcr5 hours ago|

prev|

[-]

Speculative decoding shouldn't actually change the accuracy of the response. The draft model drafts a couple tokens, and the inference framework verifies that the larger model would have picked them.

However, I've found that speculative decoders don't help much if you're running a model locally on limited hardware (for instance, my 32GB VRAM M1 Max from 2021). For one, you have to fit both the large and the small drafter model in memory. For another, if you're running a quantized model, the activation distribution is different enough that the draft model has a hard time guessing what's coming next.

My take is that speculative decoding is most useful on _very expensive_ prosumer/hobbyist setups where you have 128GB of VRAM and are running your local models with full fidelity. It's also helpful for inference providers where they can send output tokens at a computational cost slightly higher than their input token cost.

by NitpickLawyer5 hours ago|

parent|

[-]

Your experience might be a bit dated, depending on when was the last time you tried it. MTP (which is a flavor of spec decoding) is showing really solid improvements on local models, even on consumer hardware.

In fact, as the article mentions, you get the biggest gains at low concurrency (so local should apply), with diminishing returns for higher concurrency (if you think in terms of unit of compute, it's probably better to serve more requests in parallel and get more throughput that way).

Eagle3 was great at low context tho, and this seems to improve things at high context. That's really cool, and hopefully it'll turn oout to be useful at those lengths. Eagle3 is also training dependant, so you could try training your own, if your use-cases diverge enough that 3rd party "generalist" models don't suit your needs. (in general nvda, redhat, etc. have provided general eagle3 models for popular families).

by tssge2 hours ago|

parent|

[-]

The reason speculative decoding shows diminishing returns in batched workloads is because the principle of both is the same.

Speculative decoding predicts a group of tokens and verifies this group using the main model in one pass instead of decoding each token separately. Eg. for this group, the weights are loaded from RAM per group instead of per token: roughly the same computation is performed but not the same memory movement (and other overhead like kernel launches).

Batching utilizes the same mechanism, so speculative decoding is essentially an attempt to batch a single stream using prediction. An attempt, because the verification may reject some tokens if the prediction was inaccurate.

by nivekney6 hours ago|

prev|

[-]

I think so, the benchmark is on a coding dataset (SPEED-Bench).