upvote
Some huge percentage of that is just drivers. The kernel is likely what would be of interest to someone in this regard; moreover, much of that is architecture specific. IIRC the x86 kernel is <1M lines, though probably not <1M tokens.
reply
The AMDGPU driver alone is 5 million lines - out of about 37 million lines total. Over 10% of the codebase is a driver for a single vendor, although most of it is auto generated per-product headers.
reply
You can use the AST for some languages to identify modular components that are smaller and can fit into the 1M window
reply
The first path would be the most interesting, especially if it can be automated.
reply