GPU0 GPU1
GPU0 X CNS
GPU1 CNS X
i guess not, i use llama.cpp with:
--spec-draft-n-max 3 --spec-type draft-mtp --split-mode tensor --tensor-split 1,1
and my (gen) tk/s are between 60-80 tk/s
will test this uncensored model and ngram added as well this weekend
btw, i also set my powerlimit to 220watt per card (with nvidia-smi) that will cost you around 1 tk/s but safe you a LOT of power and heat :)