The main difference comes from rounding order of reduction difference.
It does make a small difference. Unless you have an unstable floating point algorithm, but if you have an unstable floating point algorithm on a GPU at low precision you were doomed from the start.