upvote
The model isn't trained, it isn't differentiable (read carefully to the end: they say their model might still work if they made it differentiable, but they don't know), and it isn't clear IMO it could ever be made trainable (what is your loss function that scores a "partially correct" program / compiler, and how are you getting such training data?).

You need non-linearity in self-attention because it encodes feature / embedding similarities / correlations (e.g. self-attention is kernel smoothing) and/or multiplicative interactions, it has nothing to do with determinism/indeterminism. Also, LLMs are not really nondeterministic in any serious way, that all just comes from tweaks and optimizations that are not at all core to the architecture.

reply