upvote
> It does exactly the same, predicts tokens,

That is an absolutely wild claim you've made. You're being way to presumptious.

reply
LLMs are “concept based” too, if you can call statistical patterns that. In a multi-modal model the embeddings for text, image and audio exist in the same high-dimensional space.

We don’t seem to have any clue if this is how our brain works, yet.

reply