upvote
The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.

The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.

reply
iirc, there is a bunch of formal machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)
reply