upvote
> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.

But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.

reply
It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable.
reply
Actually, Google's TPUs are also deterministic!
reply
You can tell GPUs what order to do math instructions in.
reply
You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it
reply
> GPUs put the associativity of the sums in matrix multiplications in arbitrary order

That’s user-controlled too, not an inherent property of GPUs:

https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...

reply
The matrix multiplication is only deterministic for sparse-dense products under these settings:

> torch.bmm() when called on sparse-dense CUDA tensors

And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.

reply
Oh, thanks, that’s interesting, I thought it covered that too!
reply