Kaparthy's notes on improving nanochat [1] are one of my favorite blog-like things to read. Really neat to see which features have how much influence, and how the scaling laws evolve as you improve the architecture
There's also modded-nanogpt which turns the same kind of experimentation into a training speedrun (and maybe loses some rigor on the way) [2]
1 https://github.com/karpathy/nanochat/blob/master/dev/LOG.md