upvote
it seems to be something that’s similar to the class of optimizations associated with with linear or state space attention when things models often do is once they figure out an optimization like this they create a ratio between full resolution blocks and blocks that have the optimization implemented.
reply