undefined

points

[-]

As far as I know, speculative decoding still verifies that the proposed tokens are what the "big" model would generate, it just uses the guesses to make that process faster. Setting the probability threshold too low then shouldn't affect correctness, just speed (time will be wasted verifying bad guesses).

by lreeves5 hours ago|

parent|

[-]

But won't setting it to accept 100% of the proposed tokens will skip the verification?

by ac294 hours ago|

prev|

[-]

None of those settings set the speculative decoder to accept 100% of drafted token. I assume you are looking at --draft-p-min 0.0, if so, you are misunderstanding what it does.

by naasking5 hours ago|

prev|

[-]

It depends on the type of MTP. If you're using two models, draft + full, then arguably yes, the larger model isn't providing much benefit if you really are seeing 100% acceptance rates. There are other forms of speculative decoding that work within the larger model by itself though, eg. Qwen has additional speculative decoding attention heads, so there is no secondary drafting model.