upvote
This is exactly what I was looking for, thanks!
reply
In this particular case the embedding wouldn't tell you anything about river bank vs any other bank. That info would come from the context, which is calculated in the attention matrix. Bank would have a strong connection to another token (or several ones) that defines its exact meaning in the current context and together they would create a feature vector in an intermediate embedding space somewhere in the deep layers of the model. The embedding space talked about here is just the input/output matrix that compactifies a huge, highly sparse input matrix (essentially just a collection of one-hot vectors) into something more compact and less sparse.
reply