To be clear, I use AI for editing all the time. Actually, diagrams are nice.
Just some pieces like that look like copy-paste (I mean, empty lines before, code get no special typography, etc):
If we write the boundary information for a packed batch as:
B = { lengths, cu_seqlens, max_seqlen, mask structure }
then every transformer layer in that forward pass consumes the same B.
If the model has L layers, rebuilding or re-synchronizing on B once per layer is not new work. It is the same information being reconstructed again and again.
In other words, the useful work is:
build B once, use it L times.
The wasteful version is:
build B + build B + ⋯ + build B (L times)