Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM
so there is alwasy a maximum limit for how well MTP can do.
- persistent CUDA kernel
- tiled processing with overlapping read/writes
- model designed with specific constraints in mind