upvote
This is a great question. You definitely aren't training this to use it, you're training it to understand how things work. It's an educational project, if you're interested in experimenting with things like distributed training techniques in JAX, or preference optimisation, this gives you a minimal and hackable library to build on.
reply
It's also a great base for experimentation. If you have an idea for an architecture improvement you can try it for $36 on the 20 layer nanocode setting, then for another $200 see how it holds up on the "full scale" nanocode

Kaparthy's notes on improving nanochat [1] are one of my favorite blog-like things to read. Really neat to see which features have how much influence, and how the scaling laws evolve as you improve the architecture

There's also modded-nanogpt which turns the same kind of experimentation into a training speedrun (and maybe loses some rigor on the way) [2]

1 https://github.com/karpathy/nanochat/blob/master/dev/LOG.md

2 https://github.com/kellerjordan/modded-nanogpt

reply