upvote
How does this work with scaling?

I assume you can then somehow run several hundreds of prompts concurrently?

reply
You can get 1M context with the lukealonso NVFP4 quant on 8x RTX6000s, which remains coherent and useful through at least 400k. No real need to run 8x H200s unless you just want to. Or unless you need to serve many concurrent users or agents on a regular basis.
reply