> The model cares about what you're saying, not what language you're saying it in.
What is the number of languages model is trained upon? And what is the number of training set sentences? I believe that these numbers are vastly different and cosine similarity is overwhelmingly biased by number of sentences.What if we equalize number of languages and number of sentences in the training set? A galaxy-wise LLM, so to say.
Also, model can't help but care about language because your work shows divergence of cosine similarity at the decoding (output) stage(s).
Would this be sort of like saying the way embeddings of different primitives across languages end up distributed in a vector space all follow the same principles and "laws"?
For example, if I train a large corpus of english and, separately, a large corpus of spanish, in both cases the way language constructs that are equivalent across both will end up represented using the same vector space patterns?