upvote
Right. If the dynamics of training are governed by RG flow, then the best optimization path should remove redundant directions, as specified by the RG operator(s)
reply
Yes, there must be a connection. While adaptive truncation may prove impractical, it should be possible to measure spectral statistics on sample data, and specify a different fixed truncation order per layer, per head, etc. The github repository lists many other possible improvements: https://github.com/glassroom/sata_attention#proof-of-concept
reply