I suggest you look up the name of the main author of TFA before assuming they don’t know what they are talking about.
This is literally one of the most knowledgeable person on the topic. I think you are the one that hasn’t peeled enough layers to connect with what they are saying.
Hold on a second. A transformer produces deterministically a probability distribution over the token alphabet from the context. Then one samples from this distribution. This is random and meant to be random.
The sampling process isn't random. If you sample with identical sampling parameters and identical values for said parameters, you will always get same results. You only start getting "non deterministic" behavior when you start using more complex systems outside the scope of your control like multi gpu systems and batch processing. One llm sampled with cash prompting off and and batch processing off will always generate same results if all values are same.
It's possible to deterministically sample from a probability distribution. For example, just seed your RNG with a constant, or with the SHA256 hash of the context.
Well yes, you can "hack" the pseudorandom number generator, but... that's not really the point when talking about determinism in LLMs is it? I mean the mathematical idea of the standard LLM is certainly truly random.
I suppose I have issues with the way "determinism" is used in the title of this article. It can mean different things to different people and in my mind stating that "Defeating Nondeterminism in LLM Inference" frames it as an actual issue with LLM inference. But its not, its an issue with LLM inference when you start using large scale inference with more complex parts such as systems which use multi gpu inference systems or batching processes and other mechanisms. It is not an issue when using an LLM without those more complex parts. Stating it this way muddies the signal and gives a false sense that this is a fundamental issue with architecture, where its an issue of the systems at scale...