In this particular case the embedding wouldn't tell you anything about river bank vs any other bank. At that stage of the computation, this info simply isn't encoded yet. That would come from the context, which is later calculated in the attention matrix, i.e. the only place were tokens are cross-computed along the sequence dimension. Bank would have a strong connection to another token (or several ones) that defines its exact meaning in the current context and together they would create a feature vector in an intermediate embedding space somewhere in the deep layers of the model. The embedding space talked about here is just the input/output matrix that compactifies a huge, highly sparse input matrix (essentially just an array of one-hot vectors glued together) into something more compact and less sparse. There's no real theoretical need for this, it just so happens that GPUs suck at multiplying huge sparse matrices. If we ever get LLMs designed to run on CPUs or analog circuits, you might even be able to just get rid of it entirely.
reply