If so, then:
That seems like it would slow the ultimate computation to no more than rate rate at which they can be these computations can be verified.
That makes the verifier the ultimate bottleneck, and the other (fast, expensive -- like an NHRA drag car) pipeline becomes vestigial since it can't be trusted anyway.
So we have 20 verifiers running at 500MHz, and this stack of verifiers is trustworthy. It does reliably-good work.
We also have a single 10GHz CPU core, and this CPU core is not trustworthy. It does spotty work (hence the verifiers).
And both of these things (the stack of verifiers, the single CPU core) peak out at exactly the same computational speeds. (Because otherwise, the CPU's output can't be verified.)
Sounds great! Except I can get even better performance from this system by just skipping the 10GHz CPU core, and doing all the work on the verifiers instead.
("Even better"? Yep. Unlike that glitch-ass CPU core, the verifiers' output is trustworthy. And the verifiers accomplish this reliable work without that extra step of occasionally wasting clock cycles to get things wrong.
If we know what the right answer is, then we already know the right answer. We don't need to have Mr. Spaz compute it in parallel -- or at all.)
Small, hence fast, model predicts next tokens serially. Then a batch of tokens are validated by the main model in parallel. If there is a missmatch you reject the speculated token at that position and all subsequent speculated tokens, take the correct token from the main model and restart speculation from that.
If the predictions are good and the batch parallelism efficiency is high, you can get a significant boost.
The only problem here is that reliability is a statistical thing. You might be lucky, you might not.