undefined

points

[-]

Right. If the dynamics of training are governed by RG flow, then the best optimization path should remove redundant directions, as specified by the RG operator(s)

by fheinsen7 hours ago|

prev|

[-]

Yes, there must be a connection. While adaptive truncation may prove impractical, it should be possible to measure spectral statistics on sample data, and specify a different fixed truncation order per layer, per head, etc. The github repository lists many other possible improvements: https://github.com/glassroom/sata_attention#proof-of-concept