upvote
That is an interesting idea. I suspect if we relax the constraint that most of the layers in a loop will be in order, there is a combinatorial explosion issue.

But we could still try it out: randomize the order we call the transformer blocks, and see if it affects performance. If not, that’s extremely interesting.

reply
deleted
reply