For real work anything below 60 tokens per second is essentially unusable. That's not taking into account the prompt filling, Llama 3.1. 70b on DGX spark runs at about 800 tps running at that speed prompt filling a 512k context takes like 11 minutes.