undefined

points

[-]

> so in principle, setting temperature to 0 _should_ result in deterministic outputs

It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

by jstanley1 days ago|

parent|

[-]

> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.

But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.

by vbarrielle1 days ago|

parent|

[-]

It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable.

by 3170701 days ago|

parent|

[-]

Actually, Google's TPUs are also deterministic!

by Dylan168071 days ago|

parent|

prev|

[-]

You can tell GPUs what order to do math instructions in.

by EvgeniyZh1 days ago|

parent|

prev|

[-]

You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it

by DougBTX1 days ago|

parent|

prev|

[-]

> GPUs put the associativity of the sums in matrix multiplications in arbitrary order

That’s user-controlled too, not an inherent property of GPUs:

https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...

by vbarrielle1 days ago|

parent|

[-]

The matrix multiplication is only deterministic for sparse-dense products under these settings:

> torch.bmm() when called on sparse-dense CUDA tensors

And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.

by DougBTX1 days ago|

parent|

[-]

Oh, thanks, that’s interesting, I thought it covered that too!

by easygenes1 days ago|

prev|

[-]

There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).

by IshKebab1 days ago|

prev|

[-]

Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.

by croes1 days ago|

prev|

[-]

So you would get always the same result, but it could be the wrong one

by srdjanr1 days ago|

parent|

[-]

Of course, nothing can guarantee the right answer from LLMs

by valzam1 days ago|

prev|

[-]

I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2

by aesthesia1 days ago|

parent|

[-]

No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step.

by dvt1 days ago|

parent|

[-]

This is a very authoritative answer that should be more nuanced and caveated as implementation-dependent. In some cases, repetition penalties take precedence over sampling; top_k and top_p can also be handled before or after the temperature step. In other cases, `0` is turned into like 1e-10 or some super tiny float value (which can drift if you do any arithmetic with it). Routing, quantization, etc. can also have an effect on sampling. And yes, in some cases, setting temperature to 0 can mean "pure greedy decoding" which makes the decoder about as deterministic as it can get.