Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.
A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
ref = torch.mm(A, B)
for _ in range(1000):
assert (torch.mm(A, B) - ref).abs().max().item() == 0
I’m sort of surprised that Torch doesn’t have some kind of lazy evaluation thing to avoid computing anything here. I thought that was one of the nice things about all these fancy frameworks (if I wanted the computer to actually do silly things when I asked it to, I would use BLAS directly, right?).Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.
If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.
Now what some people want is requirements like:
- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.
- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.
These kinds of requirements betray a misunderstanding of what these LLMs are.
For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).
Nondeterminism is what currently keeps me from working with other developers.
As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".
Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.
A deterministic LLM isn't going to behave appreciably differently from a non deterministic one if your input or context varies by even a tiny bit (pun intended) each time.
i'm hoping that it becomes more useful as models improve and become more reliable at producing working code (though determinism would be great for improving prompts).
> For example, you might observe that asking ChatGPT the same question multiple times provides different results.
even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.
> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
The router also leaks batch-level information across sequences.
I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work.
Great post.
Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.
With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.
Valid point. Floating point summation is not always commutative.
TL;DR
Seed your PRNGs and call torch.use_deterministic_algorithms(True) to get the deterministic kernels. They may be slightly slower, but in practice, you probably will not notice.
Note that results will still differ between different drivers and GPUs. It would be great if NVIDIA tried harder in that regard.
I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago.
Run two of these with the same prompts and same seed and you get the same results.
Obviously in GPU clusters with different hardware things get more complicated.
"I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths.
> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
Your situation isn't really comparable.
This is literally one of the most knowledgeable person on the topic. I think you are the one that hasn’t peeled enough layers to connect with what they are saying.