upvote
I’d recommend my method of just drawing out the block diagram and drawing out + digging into the math at each step! I’m the kind of person who needs to take time to ask lots of questions before stuff clicks, and if you are too I strongly recommend it.

I picked it up from trying to teach myself that SLAM stuff. The papers are very short, but highly information dense and at the time there was no ChatGPT to help me. I got through them by just creeping my way through the math with a whiteboard, and something about drawing it out and having it there in my office made it all click. Trying to watch piecemeal lectures on YouTube or grind through foundational books like MVG just didn’t work for me, I used them instead as references for my drawings.

Same happened when I tried learning this GPT stuff. karpathy’s videos were out at the time, but I couldn’t really stay focused on them or connect the math with the code. Most other descriptions I could find were focused on getting you to use their inference library or harness. Assembling the picture together on my whiteboard by focusing on drawing out the block diagram continues to be my personal favorite method for deep understanding of complex systems.

reply
Not OP but I worked through Sebastian Raschka's "Build a Large Language Model (From Scratch)" [0] and Raj Abhijit Dandekar's "Build a DeepSeek Model (From Scratch)" [1] books.

I don't think there is anything in a transformer I couldn't explain in the smallest detail now.

[0]: https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...

[1]: https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...

reply
>I don't think there is anything in a transformer I couldn't explain in the smallest detail now.

If you're up for it I would love to know how and why positional encodings work

reply
Learn about superposition and then you will see nobody really know why this stuff works. Its actually a good interview question to set the bar....
reply
Well, as I suggested, working through the implementation yourself will give you that intuition. That said, I think the simplest way to explain why positional encodings are useful is that it gives the transformer just enough information to make attention meaningful without negatively impacting any parallel, content-based comparisons.

A vanilla self-attention layer is just a set of token vectors. Without positional info, swapping two identical embeddings changes very little about what attention can compute. We can "fix" this problem by using positional encodings. Text that has meaning isn't just a set of characters; the location and order of those characters is what provides meaning.

reply