undefined

points

by firefly20004 hours ago|

[-]

If the workload were perfectly parallelizable, your claim would be true. However, if it has serial dependency chains, it is absolutely worth it to compute it quickly and unreliably and verify in parallel

by magicalhippo2 hours ago|

parent|

[-]

This is exactly what speculative decoding for LLMs do, and it can yield a nice boost.

Small, hence fast, model predicts next tokens serially. Then a batch of tokens are validated by the main model in parallel. If there is a missmatch you reject the speculated token at that position and all subsequent speculated tokens, take the correct token from the main model and restart speculation from that.

If the predictions are good and the batch parallelism efficiency is high, you can get a significant boost.

by firefly20002 hours ago|

parent|

[-]

I have a question about what "validation" means exactly. Does this process work by having the main model compute the "probability" that it would generate the draft sequence, then probabilistically accepting the draft? Wondering if there is a better method that preserves the distribution of the main model.

by ant6n1 hours ago|

prev|

[-]

You can verify in 100-way parallel and without dependence, but you can’t do it with general computation.

by moffkalast6 hours ago|

prev|

[-]

Haha, well do you have a point there. I guess I had the P!=NP kind of verification in my head, where it's easy to check if something is right, but not as easy to compute the result. If one could make these verifiers on some kind of checksum basis or something it might still make sense, but I'm not sure if that's possible.