Even if you reduce all the non-determinism you still will not get consistent results b/c of floating point rounding & instruction scheduling in the GPU. There is no way to guarantee that the GPU pipelines will execute your instructions exactly in the order you want it to be executed b/c GPUs are now essentially equivalent to sufficiently smart compilers & perform all sorts of clever instruction re-ordering behind the scenes. Expecting complete reproducibility at scale is a pipe dream.