upvote
Yes, there’s usually no guarantee on how different hardware does operations (for example, even if the hardware is correctly rounding intermediate results, different hardware may use different tile sizes). The reproducibility here is for runs on the same machine.

Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.

reply
You can prevent reordering with sufficient amounts of compiler abuse.

With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.

reply
deleted
reply
> will not re-order the operations behind their back?

Valid point. Floating point summation is not always commutative.

reply
Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.
reply
Not only that but heterogeneous clusters (inevitable at a large enough scale) will also have non-deterministic outputs. So it's great that they wrote kernels to make the forward pass deterministic but getting rid of it entirely at data center scale would mean that they'd also have to do this type of work across cluster nodes as well to maintain "cluster" invariance & not just batch invariance.
reply