It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.
The Spark/GX10 route is a genuinely different bet though and appreciate you sharing your numbers. At the time (several months ago) the consensus was that GX10s were for fine-tuning only, and the numbers were severely low.
..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).
But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.
Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.