You should run a multi-session batched decode on that DGX unless your 13 t/s decode is already running into thermal or power limits, which I don't believe it is. (To be clear, this is a real issue on Apple Silicon machines: batched decode does not seem to unlock higher aggregate tok/s unless you're specifically trying to mitigate the drawbacks of slow streamed inference. Especially on the M5 laptops, thermal/power throttling places an early limit on your total compute.
The jury is still out on Strix Halo, but I think batched decode may turn out to be quite useful there since the bandwidth bottleneck is even more constraining there.)