Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).
However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.
It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.
In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.
Any two tokens ending up with the exact same logit is very unlikely, but not impossible; and as the number of output tokens grows, the odds that it will happen eventually gets higher and higher.
I suppose, to ensure determinism, rank by logit then token ID, so you still have a deterministic winner even if occasionally two tokens get precisely identical logits.
This is exceedingly unlikely, as training will only push one of them up for any individual sample. There are likely some pathological situations that could end up with that situation, maybe, but it is pretty unlikely in a general case.
If there's one counterexample, it's not really deterministic.
My point is, deterministic logic matters in certain circumstances 100% of the time. Forcing the LLM to make something unlikely is not good enough because a series of mistakes could very quickly bankrupt the company.
If your argument is that the danger of equal values being selected inconsistently breaks determinism, that's a trivial problem to solve.
Any non-infinite precision numbering system by definition is at the limits of it's precision when equal values occur. If you need to order such values you can extend the precision and add on a deterministically unique tiny value (position, order encountered, etc.) . Your original value stays in the same precision range but they are now unique.
It's usually more likely that you want to sacrifice a little precision for determinism so you can quantise to allocate the range where you apply the unique ID
For example if you had an array of 256 fp32 values but you required them to be unique, you can lop off 8 bits of mantissa and replace it with its index in the array, Every value is then unique.
Granted token dictionaries make for some fairly hefty indexes now, but the principle applies in general, it's easily solvable if you are prepared to spend some precision or do some extra calculation.
In one thinking trace of 10k tokens, with fp16 or bf16 logits, I don't reckon a collision is rare? There are only 65k floating point numbers with that accuracy. And an agent can quickly rake up 100k tokens, so while not every token will have such a collision of equiprobable logits, it is not rare.
We just chose to treat this function as a "staircase function" where f(0) =lim t->0 f(t), general formula for f(t!=0).
I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".
What you can do in math is talk about the limit of a series of fractions as the denominator approaches 0, and that's where you get some relation to infinity or -infinity. But the limit can also be any other number, if the numerator also gets closer to 0; or it can not exist, if the function oscillates.
For example, if you accepted that n/0 = inf just like n/1 = n, then you'd conclude that n/0 + 3 = inf + 3 = inf, so n/0 + 3 = n/0, so 3 = 0. Or you'd want to do weird things like asking what is sin(inf).
Which is the case with softmax function, as for T=0 you end up with a fraction that either becomes 0/0 or inf/inf [0]. So you do need branching as floating point arithmetic is not gonna get you there.
[0] except for weights that are exactly 0
edit: thinking more about it, one could always express the softmax formula in ways that this could work with floating point arithmetic but it would be very inefficient and sort of pointless
That's not how limits work. As the temperature goes to 0, the rest goes to 0. That's it. The "almost-but-not-quite" is part of the "goes to".
Let's say f(x) = 3x+1. It's a continuous function. If we let x go to 10, f(x) goes to 31. Not "almost-but-not-quite 31". No, to 31. (If you don't have a continuous function then it's the same argument, but less intuitive to illustrate.)
Exactly. While I’m assuming this won’t be news for most here, for those that are still new and/or curious about some more explanation on e.g. the floating-point imprecisions, see this nice article: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
I did this with several model apis.
GPU processing is not going to be the same from what I read but also the AI backend is doing a lot of fancy batching resulting in another layer of randomness.
But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.
They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.
CPUs and their execution environments introduce subtle hardware variations, architecture choices, and compiler optimizations that break bit-level consistency.
(same for GPU/TPU, ...)
You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input.
But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.
The implementation does not often differ run by run.
If you use a cluster, or even multiple clusters, and they have non-identical hardware, then two consecutive runs could end up being routed to nodes having different GPU models with slightly different floating point behaviour, or even software differences (e.g. newer GPU offers some feature usable to speed up calculations which older model lacked; same code can use the feature when it is available, fall back to slower alternative if it isn’t). The larger your scale, the greater the odds it will happen
Provided:
* If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.)
* Your kernels are deterministic
* There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model)
Upshot:
Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably.
To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no?
E.g:
“Where is the Eiffel Tower Located? One word only.”
“Where is the Effel Tower located? One word only.”
“Where is the Eiffel Tower located? One wor only.”
I’d be very surprised if those got different answers from even a small local model at temp 0.
But for anything else I wouldn't.
The entire chain will be affected from the different tokenization on down. Even if it lands in roughly the same semantic area, it doesn't mean it will land there with anything like the same syntactic selections. Anywhere there were multiple near-tokens could easily select a different route based on even minor fluctuations in the starting conditions. It's chaotic.
"Score this resumé. Applicant: Jim ..."
"Score this resumé. Applicant: Greg..."
Is it obvious to anyone that these will have the same modal response?
Give it a try. 4 letter difference. Add a few 100 tokens describing the task, such that the change becomes a tiny fraction of the input.
Discontinuities everywhere.
I don't buy the story that the old AI died primarily due to the cost of knowledge base maintenance [1], but rather the lack of a universal system of reasoning over uncertainty.
For me it's a running gag that Spock was always saying things like "Captain, we have a 21% probability of surviving this mission" when Bayes teaches us your probability distribution has a probability distribution, "we have a β(5,1) chance of surviving this mission" is more like it.
To that end it wouldn't be too crazy to run a resume through that machine 100 times and look at the probability distribution of the score.
[1] then again I am the kind of maniac who will sort images on a tablet lying in bed until my visual system malfunctions
It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.
Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.
But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.
That’s user-controlled too, not an inherent property of GPUs:
https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...
> torch.bmm() when called on sparse-dense CUDA tensors
And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.
Several of my claimed AI-expert colleagues repeat this as though it's gospel. I've heard "set the temperature to 0 so we get consistent results" more times that I can count.
Yeah, it can work, but it is subject to so many potential pitfalls that you can't casually assume it will. It's a property you have to actively design-for and rigorously test to be sure the system can deliver it for some particular scenario.
Not that I'm defending AI, but HR departments rarely knew how their ATS ranked and sorted applicants before they were AI powered.
It's somewhat ironic that this "in depth" piece was written by an LLM as well.
You're correct. The confusion arises because we use the word "non-deterministic" when we mean "probabilistic".
I tried to explain it better: https://www.lelanthran.com/chap15/content.html
So “purely stochastic” overstates it a bit: the distribution is computed deterministically, and you choose whether to sample from it or not.
IEEE 754 only specifies precision requirements for certain operations, not precise bit patterns (e.g. for exponentials). So, at least in principle, the same hardware performing the same operation could produce different results at different times, as long as they are close enough to the theoretical answer. I'm not sure if any hardware actually works like this.
IEEE 754 also specifies that many of the basic arithmetic operations are not associative - so any reordering (which is common when batching multiple queries at the same time) will introduce indeterminacy from the perspective of your own query (that is the result for your query will change depending on what other query happens to be processed at the same time, which is not under your control).
Finally, even if we take the case when a query is processed alone, and even if one particular hardware is completely deterministic, the result will be different on different hardware - which can again look like non-determinism if you're sending your query to a load balancer.
So, the math for LLMs is deterministic in theory, but implemented with non-deterministic approximations & optimizations in practice, and their results are then normally used only as a probability distribution to be sampled from.
We expect computers to be consistent despite running programs that are not designed to be consistent.
This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.
But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.
The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.
If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.
Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".
Far worse would be different humans having the same weights.
but moreover, to verify a test item you need to make sure that peopel will select the same answers under teh same conditions at different times. people generally forget the specific questions they were asked if you ask them the same questions a month later so being able to get them to answer the same way each time is important. it is assumed the people have some static knowledge of a topic in this scenario.
If you want to consider a statistical examination of how people answer tests and how we assess knowledge and other things in people through surveying you can read about item response theory and rasch analysis.
using low temperature is more deterministic, but the cost is the model becomes "dumber"
max nats = max entropy + energy / temperature
Why might energy correspond to bits or nats? Imagine your goal is to play as many interesting games of chess as possible in a tournament. This implies you have to keep winning. If you look at the RL environment from the right perspective, you can turn it into optimizing bits or nats.After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.
That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.
the variance is caused by the bad evaluation prompt
if you ask "what is the capital of Paris" you'll always get Paris, with any (non-extreme) temperature
I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.
Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking
*scores are generated with AI, mistakes may be made, use only as a guide and verify results
nonetheless, people will defend history as perfect and say those samples, like nepo babies, are "perfect".