Is this a thing? I read an article about how due to some implementation detail of GPUs, you don't actually get deterministic outputs even with temp 0.
But I don't understand that, and haven't experimented with it myself.
The main difference comes from rounding order of reduction difference.
It does make a small difference. Unless you have an unstable floating point algorithm, but if you have an unstable floating point algorithm on a GPU at low precision you were doomed from the start.