undefined

upvote

points

by dvt1 days ago |

upvote

by miki1232111 days ago|

[-]

In theory, temperature 0 does make the LLM deterministic.

Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).

However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

reply

upvote

by sigmoid101 days ago|

[-]

>in theory theory, temperature 0 doesn't really exist.

It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.

reply

upvote

by 3170701 days ago|

[-]

> Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function.

In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.

reply

upvote

by sigmoid101 days ago|

[-]

That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal. That's why greedy sampling generally produces deterministic output for LLMs. The real gotchas are elsewhere (like with batch inference as we've seen with earlier GPTs). But unlike what the earlier comment says, this is a non-issue mathematically.

reply

upvote

by skissane1 days ago|

[-]

> That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal

Any two tokens ending up with the exact same logit is very unlikely, but not impossible; and as the number of output tokens grows, the odds that it will happen eventually gets higher and higher.

I suppose, to ensure determinism, rank by logit then token ID, so you still have a deterministic winner even if occasionally two tokens get precisely identical logits.

reply

upvote

by spott1 days ago|

[-]

You aren't looking for a random set of tokens that have the exact same logit, you are looking for the largest n tokens to have the exact same probability.

This is exceedingly unlikely, as training will only push one of them up for any individual sample. There are likely some pathological situations that could end up with that situation, maybe, but it is pretty unlikely in a general case.

reply

upvote

by StilesCrisis1 days ago|

[-]

"Makes unlikely" is very different from "prevents."

If there's one counterexample, it's not really deterministic.

reply

upvote

by rkozik19891 days ago|

[-]

Exactly, consider the scenario where laws are at play and violating them could cost companies thousands. Recently my father received a 'request for address' letter addressed to me at his nursing home, the building has always been a nursing home, and he's also in his mid-70s. That's very obviously a violation of the Fair Debt Collection Practices Act. Imagine the implication of this if the law firm in questions used an AI-assisted data enriching product to find this information. That SaaS company is not only liable to that one law firm but every law firm who uses their software. Its potentially a federal class action lawsuit.

My point is, deterministic logic matters in certain circumstances 100% of the time. Forcing the LLM to make something unlikely is not good enough because a series of mistakes could very quickly bankrupt the company.

reply

upvote

by Lerc1 days ago|

[-]

>My point is, deterministic logic matters in certain circumstances 100% of the time. Forcing the LLM to make something unlikely is not good enough because a series of mistakes could very quickly bankrupt the company.

If your argument is that the danger of equal values being selected inconsistently breaks determinism, that's a trivial problem to solve.

Any non-infinite precision numbering system by definition is at the limits of it's precision when equal values occur. If you need to order such values you can extend the precision and add on a deterministically unique tiny value (position, order encountered, etc.) . Your original value stays in the same precision range but they are now unique.

It's usually more likely that you want to sacrifice a little precision for determinism so you can quantise to allocate the range where you apply the unique ID

For example if you had an array of 256 fp32 values but you required them to be unique, you can lop off 8 bits of mantissa and replace it with its index in the array, Every value is then unique.

Granted token dictionaries make for some fairly hefty indexes now, but the principle applies in general, it's easily solvable if you are prepared to spend some precision or do some extra calculation.

reply

upvote

by 3170701 days ago|

[-]

> for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal.

In one thinking trace of 10k tokens, with fp16 or bf16 logits, I don't reckon a collision is rare? There are only 65k floating point numbers with that accuracy. And an agent can quickly rake up 100k tokens, so while not every token will have such a collision of equiprobable logits, it is not rare.

reply

upvote

by miki12321116 hours ago|

[-]

There's a difference between f(0) and lim t->0 f(t).

We just chose to treat this function as a "staircase function" where f(0) =lim t->0 f(t), general formula for f(t!=0).

reply

upvote

by thaumasiotes1 days ago|

[-]

> It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients.

I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".

reply

upvote

by sigmoid101 days ago|

[-]

The point is that the case T=0 doesn't just "exist" as a special code branch - it is still well defined mathematically without any change to the output function. What the above comment refers to with the extra "if" check is just a limitation of computers not liking to divide anything by zero, even if the actual function exists and is well behaved at zero. It is not some weird or special theoretical construction.

reply

upvote

by StilesCrisis1 days ago|

[-]

Floating point defines n/0 the same as math. It's infinity as long as n isn't zero.

reply

upvote

by simiones1 days ago|

[-]

In almost all forms of math, the value n/0 is undefined. It's definitely not infinity, for two reasons - depending on the value of n, it can be negative; and neither info nor -inf are numbers, so they can't be the result of an equation (unless you look at transfinite equations).

What you can do in math is talk about the limit of a series of fractions as the denominator approaches 0, and that's where you get some relation to infinity or -infinity. But the limit can also be any other number, if the numerator also gets closer to 0; or it can not exist, if the function oscillates.

reply

upvote

by StilesCrisis1 days ago|

[-]

I explicitly didn't say "infinity or negative infinity" because I didn't think that level of pedantry would be needed here on HN. I guess I was wrong.

reply

upvote

by jdiff1 days ago|

[-]

It's not positive or negative infinity. It is simply undefined. Math has many conventions, and you can define your own convention that it does equal some flavor of infinity, but that is only a convention, and not a universal one.

reply

upvote

by throw-the-towel1 days ago|

[-]

All discussions of mathematics assume maximal possible pedantry.

reply

upvote

by simiones1 days ago|

[-]

That's not the problem, and this is not just pedantry. It's just not correct to say that n/0 = inf, nor even to say that positive_n / 0 = inf, in any normal math context.

For example, if you accepted that n/0 = inf just like n/1 = n, then you'd conclude that n/0 + 3 = inf + 3 = inf, so n/0 + 3 = n/0, so 3 = 0. Or you'd want to do weird things like asking what is sin(inf).

reply

upvote

by freehorse1 days ago|

[-]

> as long as n isn't zero

Which is the case with softmax function, as for T=0 you end up with a fraction that either becomes 0/0 or inf/inf [0]. So you do need branching as floating point arithmetic is not gonna get you there.

[0] except for weights that are exactly 0

edit: thinking more about it, one could always express the softmax formula in ways that this could work with floating point arithmetic but it would be very inefficient and sort of pointless

reply

upvote

by teiferer1 days ago|

[-]

> Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0.

That's not how limits work. As the temperature goes to 0, the rest goes to 0. That's it. The "almost-but-not-quite" is part of the "goes to".

Let's say f(x) = 3x+1. It's a continuous function. If we let x go to 10, f(x) goes to 31. Not "almost-but-not-quite 31". No, to 31. (If you don't have a continuous function then it's the same argument, but less intuitive to illustrate.)

reply

upvote

by sobellian1 days ago|

[-]

Even if it's deterministic that doesn't mean it isn't arbitrary. I can achieve determinism at any temperature by saving the seed. But that wouldn't make rejects feel much better knowing that if a bit was flipped in an arbitrary seed they would be scored differently.

reply

upvote

by msdz1 days ago|

[-]

> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

Exactly. While I’m assuming this won’t be news for most here, for those that are still new and/or curious about some more explanation on e.g. the floating-point imprecisions, see this nice article: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

reply

upvote

by johnsmith18401 days ago|

[-]

I did large scale tests temp 0 and there was still randomness with the same prompt inputs coming in.

I did this with several model apis.

GPU processing is not going to be the same from what I read but also the AI backend is doing a lot of fancy batching resulting in another layer of randomness.

reply

upvote

by pmarreck1 days ago|

[-]

It is not deterministic because the order of computations in a typical multithreaded system is not deterministic and also because when combined with the devil that is IEEE754, it gets even less deterministic.

reply

upvote

by lelandbatey1 days ago|

[-]

As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware).

But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.

They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.

reply

upvote

by toolslive1 days ago|

[-]

It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations.

reply

upvote

by rightbyte1 days ago|

[-]

How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.

reply

upvote

by StilesCrisis1 days ago|

[-]

IEEE-754 doesn't mandate exact results for functions like exp(x). It mandates things like "within 2 ULP of the true answer." Hardware vendors are free to implement these functions in any way that meets the error tolerance.

reply

upvote

by hansvm1 days ago|

[-]

Batching order, as you mentioned, matters a lot, and for any heavily optimized kernels it will change from one machine to the next. You also have the choice of backend numerical library from, e.g., different OS versions. There are floating-point bugs from time to time, especially in GPUs. Many operations (like transcendentals) are usually given a couple bits of wiggle room in the result. Another program executing could have changed the floating-point rounding mode on one device. More aggressive ML optimizers might automatically apply various forms of reduced precision to the requested high-level operation. If you have enough optimizations enabled, you might non-deterministically get compiled instructions like fmadd so that any one build of your library is deterministic (excluding other ideas mentioned above) but different machines with different builds (because of a staged rollout, different architectures, engineering mistakes, etc) can have different outputs. And so on.

reply

upvote

[-]

deleted

reply

upvote

by toolslive1 days ago|

[-]

While the IEEE 754 standard ensures that individual basic operations are deterministic and strictly bounded, it does not guarantee that an entire program will yield bit-identical results on all CPUs.

CPUs and their execution environments introduce subtle hardware variations, architecture choices, and compiler optimizations that break bit-level consistency.

(same for GPU/TPU, ...)

reply

upvote

by vlovich1231 days ago|

[-]

Parent is correct - the math is very deterministic if you can guarantee it’s running repeatedly on the same machine and you’re not processing “random” requests in parallel. The compiler is irrelevant because once the code is generated it’s not getting recompiled and thus isn’t a source of non determinism (and generally if you don’t touch the math the compiler will consistently emit the same underlying machine code).

reply

upvote

by simiones1 days ago|

[-]

This sub-thread was about cloud environments, where different requests may be served by different hardware. And it's in fact very likely that there will be a mix of different hardware from different vendors, in any particular LLM cloud for now.

reply

upvote

by throwaway1737381 days ago|

[-]

It is, after all, a fundamentally voltage-based process, and the logical “no-man’s land” is chosen to limit the likelihood of a weak component producing faulty logic, but it’s impractical to run through the set of all possible starting states and to verify that after an unbounded number of clock steps the machine reaches a predictable end state on all of the devices being manufactured.

reply

upvote

by microtonal1 days ago|

[-]

Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient.

reply

upvote

by vlovich1231 days ago|

[-]

They are only non-deterministic when you’re doing batching and a kernel ends up running across a “random” set of token streams. If you’re only processing one user’s request, they’re very much deterministic.

reply

upvote

by nok22kon1 days ago|

[-]

that's incorrect in the presence of batching. it's tough work making it truly deterministic:

https://x.com/FireworksAI_HQ/status/2069873437217276015

reply

upvote

by vidarh1 days ago|

[-]

It's not that hard. What is hard is making it truly deterministic and retain high throughput.

reply

upvote

by gaflo1 days ago|

[-]

PRNG is deterministic.

reply

upvote

by nullc1 days ago|

[-]

If you make an exact integer implementation and run with temp=0 it's deterministic.

You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input.

But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.

reply

upvote

by chrisjj1 days ago|

[-]

> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run

The implementation does not often differ run by run.

reply

upvote

by skissane1 days ago|

[-]

> The implementation does not often differ run by run.

If you use a cluster, or even multiple clusters, and they have non-identical hardware, then two consecutive runs could end up being routed to nodes having different GPU models with slightly different floating point behaviour, or even software differences (e.g. newer GPU offers some feature usable to speed up calculations which older model lacked; same code can use the feature when it is available, fall back to slower alternative if it isn’t). The larger your scale, the greater the odds it will happen

reply

upvote

by vessenes1 days ago|

[-]

To be clear, temperature 0 is deterministic and will produce the same output for exact duplicate inputs, across all seed choices.

Provided:

* If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.)

* Your kernels are deterministic

* There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model)

Upshot:

Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably.

To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no?

reply

upvote

by Dylan168071 days ago|

[-]

Even then it's deterministic in the way a hash function is deterministic. Change one letter and you can get a completely different output. What people actually want is something continuous.

reply

upvote

by vessenes1 days ago|

[-]

Agreed on the desire for continuous behavior. That said, in a modern LLM, is this hash analogy accurate? I would be surprised if a single letter changed most zero temp force ranked outputs.

E.g:

“Where is the Eiffel Tower Located? One word only.”

“Where is the Effel Tower located? One word only.”

“Where is the Eiffel Tower located? One wor only.”

I’d be very surprised if those got different answers from even a small local model at temp 0.

reply

upvote

by knome1 days ago|

[-]

For a single word response, perhaps.

But for anything else I wouldn't.

The entire chain will be affected from the different tokenization on down. Even if it lands in roughly the same semantic area, it doesn't mean it will land there with anything like the same syntactic selections. Anywhere there were multiple near-tokens could easily select a different route based on even minor fluctuations in the starting conditions. It's chaotic.

reply

upvote

by sobellian1 days ago|

[-]

I don't know about single letters, but single words?

"Score this resumé. Applicant: Jim ..."

"Score this resumé. Applicant: Greg..."

Is it obvious to anyone that these will have the same modal response?

reply

upvote

by vessenes23 hours ago|

[-]

I believe there's some data that they will have different responses if the names signify different cultural / race / gender affiliations. Here be dragons.

reply

upvote

by forlorn_mammoth1 days ago|

[-]

"Your are a helpful/less assistant"

Give it a try. 4 letter difference. Add a few 100 tokens describing the task, such that the change becomes a tiny fraction of the input.

Discontinuities everywhere.

reply

upvote

by vessenes23 hours ago|

[-]

But those are VERY different types of assistant. It is correct behavior that you would get different outputs in this case.

reply

upvote

by guhcampos1 days ago|

[-]

This is it. People mistake deterministic for precise/exact/correct. It's not.

reply

upvote

by PaulHoule1 days ago|

[-]

The whole problem of text understanding is a problem of reasoning under uncertainty, that is, you can't really be sure which witch people are talking about all the time. A person you might hire might be successful or unsuccessful at the role, no matter what hiring process you use. Two people might look at the same resume and come to the same conclusions. Two patients with the same symptoms and clinical presentation might have different diseases, etc.

I don't buy the story that the old AI died primarily due to the cost of knowledge base maintenance [1], but rather the lack of a universal system of reasoning over uncertainty.

For me it's a running gag that Spock was always saying things like "Captain, we have a 21% probability of surviving this mission" when Bayes teaches us your probability distribution has a probability distribution, "we have a β(5,1) chance of surviving this mission" is more like it.

To that end it wouldn't be too crazy to run a resume through that machine 100 times and look at the probability distribution of the score.

[1] then again I am the kind of maniac who will sort images on a tablet lying in bed until my visual system malfunctions

reply

upvote

by fatnoah7 hours ago|

[-]

My favorite recent example was submitting a resume for a job that was almost a word-for-word description of my current title and job at a similarly sized company. Within 24 hours, I got the rejection, and several days later, a recruiter reached out to let me know that my profile looked like a great match for the role and wanted to schedule an intro call.

reply

upvote

by aesthesia1 days ago|

[-]

A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.

reply

upvote

by 3170701 days ago|

[-]

> so in principle, setting temperature to 0 _should_ result in deterministic outputs

It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

reply

upvote

by jstanley1 days ago|

[-]

> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.

But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.

reply

upvote

by vbarrielle1 days ago|

[-]

It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable.

reply

upvote

by 3170701 days ago|

[-]

Actually, Google's TPUs are also deterministic!

reply

upvote

by Dylan168071 days ago|

[-]

You can tell GPUs what order to do math instructions in.

reply

upvote

by EvgeniyZh1 days ago|

[-]

You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it

reply

upvote

by DougBTX1 days ago|

[-]

> GPUs put the associativity of the sums in matrix multiplications in arbitrary order

That’s user-controlled too, not an inherent property of GPUs:

https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...

reply

upvote

by vbarrielle1 days ago|

[-]

The matrix multiplication is only deterministic for sparse-dense products under these settings:

> torch.bmm() when called on sparse-dense CUDA tensors

And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.

reply

upvote

by DougBTX1 days ago|

[-]

Oh, thanks, that’s interesting, I thought it covered that too!

reply

upvote

by easygenes1 days ago|

[-]

There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).

reply

upvote

by IshKebab1 days ago|

[-]

Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.

reply

upvote

by croes1 days ago|

[-]

So you would get always the same result, but it could be the wrong one

reply

upvote

by srdjanr1 days ago|

[-]

Of course, nothing can guarantee the right answer from LLMs

reply

upvote

by valzam1 days ago|

[-]

I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2

reply

upvote

by aesthesia1 days ago|

[-]

No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step.

reply

upvote

by dvt1 days ago|

[-]

This is a very authoritative answer that should be more nuanced and caveated as implementation-dependent. In some cases, repetition penalties take precedence over sampling; top_k and top_p can also be handled before or after the temperature step. In other cases, `0` is turned into like 1e-10 or some super tiny float value (which can drift if you do any arithmetic with it). Routing, quantization, etc. can also have an effect on sampling. And yes, in some cases, setting temperature to 0 can mean "pure greedy decoding" which makes the decoder about as deterministic as it can get.

reply

upvote

by mywittyname1 days ago|

[-]

> This is not correct

Several of my claimed AI-expert colleagues repeat this as though it's gospel. I've heard "set the temperature to 0 so we get consistent results" more times that I can count.

reply

upvote

by Terr_1 days ago|

[-]

I imagine it's much like game-developers saying: "Set a fixed seed so the player gets consistent results."

Yeah, it can work, but it is subject to so many potential pitfalls that you can't casually assume it will. It's a property you have to actively design-for and rigorously test to be sure the system can deliver it for some particular scenario.

reply

upvote

by thesuitonym1 days ago|

[-]

> resumes are just dumped in some LLM black hole and no one really knows how it works.

Not that I'm defending AI, but HR departments rarely knew how their ATS ranked and sorted applicants before they were AI powered.

reply

upvote

by margalabargala1 days ago|

[-]

> I'm happy to see in-depth pieces like this

It's somewhat ironic that this "in depth" piece was written by an LLM as well.

reply

upvote

by lelanthran1 days ago|

[-]

> temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).

You're correct. The confusion arises because we use the word "non-deterministic" when we mean "probabilistic".

I tried to explain it better: https://www.lelanthran.com/chap15/content.html

reply

upvote

[-]

deleted

reply

upvote

by make31 days ago|

[-]

A more spikey distribution exactly makes the distribution closer to deterministic. That's not the point though. Even in greedy (deterministic) decoding, it is still a black box though that reacts in ways ways that are unpredictable to the inputs. Switching one word around might lead to different scores for example.

reply

upvote

by fluoridation1 days ago|

[-]

Yeah, this is the forest that the people arguing about math trees are missing. It doesn't matter that the algorithm is deterministic if the algorithm passes the input through a cryptographic hash function to make a yes/no decision. The result may be perfectly reproducible and still non-sensical in its distribution with respect to its input domain.

reply

upvote

by Nimitz141 days ago|

[-]

He said it nudges it to be more deterministic. Your comment is not correct.

reply

upvote

by bhanu7861 days ago|

[-]

Agree

reply

upvote

by mtharrison1 days ago|

[-]

Small refinement: the underlying model isn’t stochastic at all. The forward pass is a deterministic function of the weights and input, it just produces a probability distribution over the next token. The stochasticity is an optional sampling step layered on top, not something inherent to LLMs. Greedy/argmax decoding (or temperature 0) makes the whole thing deterministic.

So “purely stochastic” overstates it a bit: the distribution is computed deterministically, and you choose whether to sample from it or not.

reply

upvote

by simiones1 days ago|

[-]

There are more layers to this problem, if we want to get into the details. The LLM is defined in terms of floating point operations, and those are not actually fully deterministic, on most hardware and in most performant implementations.

IEEE 754 only specifies precision requirements for certain operations, not precise bit patterns (e.g. for exponentials). So, at least in principle, the same hardware performing the same operation could produce different results at different times, as long as they are close enough to the theoretical answer. I'm not sure if any hardware actually works like this.

IEEE 754 also specifies that many of the basic arithmetic operations are not associative - so any reordering (which is common when batching multiple queries at the same time) will introduce indeterminacy from the perspective of your own query (that is the result for your query will change depending on what other query happens to be processed at the same time, which is not under your control).

Finally, even if we take the case when a query is processed alone, and even if one particular hardware is completely deterministic, the result will be different on different hardware - which can again look like non-determinism if you're sending your query to a load balancer.

So, the math for LLMs is deterministic in theory, but implemented with non-deterministic approximations & optimizations in practice, and their results are then normally used only as a probability distribution to be sampled from.

reply

upvote

by spwa41 days ago|

[-]

[flagged]

reply

upvote

by mahogany1 days ago|

[-]

Every time people point out a limitation or constraint of LLMs, I see a comment that is to the effect of “but humans…”. I don’t understand why this comparison is relevant to this particular thread. Is it just an amusing similarity?

reply

upvote

by efromvt1 days ago|

[-]

I think it often useful to push the conversation down "we built a system for humans that dealt with this, what from that is or is not applicable for agents in the same context"? Humans randomizing resume review for screening is pretty known; I've seen companies try to fight it with things like hiding information, panel reviews, etc - it's unclear to me how effective those would be for agents (honestly, it was unclear how effective those were for humans). I was depressed about the hiring process before we had AI screening and I remain depressed about it.

reply

upvote

by castlecrasher21 days ago|

[-]

It may seem trite but the point is that if separate humans were assigned the same task the LLM was here the results would be similarly non-deterministic.

reply

upvote

by spwa41 days ago|

[-]

Indeed: LLMs do tasks that would otherwise be assigned to humans. So when pointing out deficiencies in LLM performance they should be compared to the alternative, which also isn't perfect.

reply

upvote

by smusamashah1 days ago|

[-]

We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.

reply

upvote

by vidarh1 days ago|

[-]

And this lies at the heart of the problem.

We expect computers to be consistent despite running programs that are not designed to be consistent.

This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.

reply

upvote

by chrisjj1 days ago|

[-]

> This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.

reply

upvote

by vidarh1 days ago|

[-]

The average user is familiar with games.

reply

upvote

by chrisjj1 days ago|

[-]

Clocks too.

reply

upvote

by newswasboring1 days ago|

[-]

Yeah but daily tools have lots of complexity which appears as non determinism (if we are thinking only UX, not actual determinism). For example, try moving an image in the word doc. I have been using MS word my entire life it seems, still don't know what the rules are lol.

reply

upvote

by chrisjj1 days ago|

[-]

You're using a mouse? I have no problem getting reliable output from reliable input - through keyboard.

reply

upvote

by miki1232111 days ago|

[-]

What's even worse, different humans have different weights.

If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.

Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".

reply

upvote

by chrisjj1 days ago|

[-]

> What's even worse, different humans have different weights.

Far worse would be different humans having the same weights.

reply

upvote

by thisisit1 days ago|

[-]

The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.

reply

upvote

by rkuodys1 days ago|

[-]

I'm pretty sure the personality tests are created specifically for the reason that a single person can have fundamentally (or conflicting) beliefs about himself in a matter of minutes. You can say "I am honest person" and the next minute you can say "I never lie" - and both cannot be true for an average person.

reply

upvote

by mnky9800n1 days ago|

[-]

Test retest reliability is a thing in psychometrics.

reply

upvote

by spwa41 days ago|

[-]

[flagged]

reply

upvote

by mnky9800n1 days ago|

[-]

There is evidence that children will oscillate between understanding and not understanding while learning topics. Philip Sadler at Harvard published about this but i can't find the paper im thinking of on his google scholar. too many papers!

but moreover, to verify a test item you need to make sure that peopel will select the same answers under teh same conditions at different times. people generally forget the specific questions they were asked if you ask them the same questions a month later so being able to get them to answer the same way each time is important. it is assumed the people have some static knowledge of a topic in this scenario.

If you want to consider a statistical examination of how people answer tests and how we assess knowledge and other things in people through surveying you can read about item response theory and rasch analysis.

reply

upvote

by cyanydeez1 days ago|

[-]

a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.

reply

upvote

[-]

deleted

reply

upvote

by ThrowawayR21 days ago|

[-]

That was a single study and it's finding is at the very least disputed, if not debunked, e.g. https://news.ycombinator.com/item?id=41091803

reply

upvote

by WhrRTheBaboons1 days ago|

[-]

how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks.

reply

upvote

[-]

deleted

reply

upvote

by nok22kon1 days ago|

[-]

its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0.

using low temperature is more deterministic, but the cost is the model becomes "dumber"

reply

upvote

by tipsytoad1 days ago|

[-]

1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default

reply

upvote

by programjames1 days ago|

[-]

1.0 is "natural units". If your energy corresponds to nats, you should be using temperature 1.0. If your energy corresponds to bits, you should be using temperature ln(2) ~= 0.7. The optimization pressure is

     max nats = max entropy + energy / temperature

Why might energy correspond to bits or nats? Imagine your goal is to play as many interesting games of chess as possible in a tournament. This implies you have to keep winning. If you look at the RL environment from the right perspective, you can turn it into optimizing bits or nats.

reply

upvote

by 3170701 days ago|

[-]

If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher.

After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.

That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.

reply

upvote

by zipy1241 days ago|

[-]

It really depends on the application does it not? I'm not an LLM guy, but for creative tasks like storytelling wouldn't you want a higher temperature usually? Happy to gain insight from anyone with experience here :)

reply

upvote

by embedding-shape1 days ago|

[-]

Heavily depends on the model architecture and the implementation though, I don't think you can say what values are better than others without first specifying those, otherwise it's straight up guessing, ironically.

reply

upvote

by nullc1 days ago|

[-]

If you use a model in a configuration far from where it was RLed you get no warranty. (you also get no warranty the other way, however)

reply

upvote

by codeflo1 days ago|

[-]

It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind.

reply

upvote

by jldugger1 days ago|

[-]

Would 1.0 have fixed the wide variance in scoring?

reply

upvote

by nok22kon1 days ago|

[-]

temperature is the wrong tool

the variance is caused by the bad evaluation prompt

if you ask "what is the capital of Paris" you'll always get Paris, with any (non-extreme) temperature

reply

upvote

by vidarh1 days ago|

[-]

Plenty of setups defaults to lower values than 1.0.

reply

upvote

by bluechair1 days ago|

[-]

Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical.

reply

upvote

by thayne1 days ago|

[-]

I would expect that to depend on jurisdiction.

I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.

reply

upvote

by small_scombrus1 days ago|

[-]

They don't need to actually filter/blackhole to have have the same virtual effect.

Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking

*scores are generated with AI, mistakes may be made, use only as a guide and verify results

reply

upvote

by ivan_gammel1 days ago|

[-]

In situations when you get hundreds of applications for one open position (real market now), whatever reduces your pool to the size a human can handle, works. You can preserve some diversity metrics in the process. This particular filtering is rather primitive, but LLM as a first filter can definitely do the job. You may burn less tokens than the hourly rate of your HR and it will be fairer than just dumping 50% of unread CVs in trash.

reply

upvote

by 3695486848928261 days ago|

[-]

Great until someone realises you’ve filtered out minority groups from the application process (most developers are men so maybe the LLM decided they’re the best fit, but you’ll never know exactly why it screwed your over) and you suddenly have an expensive lawsuit

reply

upvote

by TeMPOraL1 days ago|

[-]

LLMs are DEI-aware, as over past few years, their vendors all had various high profile news stories with their models and their default biases, so it's more likely they'll heavily discriminate in favor of minority candidates, not against them. Still, in both cases it would indicate whoever is operating the system is doing a really, really lazy job. It's really not hard to test and supervise LLMs on tasks where they give you mere 2-10x leverage, and prompt adherence today is much better than it was 3 years ago.

reply

upvote

by BigTTYGothGF1 days ago|

[-]

Just last month: https://hai.stanford.edu/news/ai-hiring-tools-can-yield-raci...

reply

upvote

by ivan_gammel1 days ago|

[-]

What „not so smart“ person would filter minority groups out of the process in 2026? It‘s more likely that 90/10 gender disbalance will be converted to 60/40 or even 50/50. Diverse teams are more fun and stable.

reply

upvote

by cyanydeez1 days ago|

[-]

this happened a decade ago when a US courted tried to make sentencing decisions via ML. it was easialy demonstrated that the training data was flawed because the justice system was flawed so the data it was trained on was weighted against minorities because it oversampled because you know, police routinely oversample and poverty for es oversampling

nonetheless, people will defend history as perfect and say those samples, like nepo babies, are "perfect".

reply

upvote

by elric1 days ago|

[-]

Under GDPR, you have the right to request manual processing whenever personal data is processed automatically to make a decision about you that has "significant impact". Not being hired seems like it would qualify.

reply

upvote

by dgellow1 days ago|

[-]

Illegal where?

reply