upvote
Here is one: An adjustment to weight updates, that makes it more likely for weights to stay uniformly distributed.

~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.

I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.

And for actual runs, from a pre-run sampled curve.

reply
And there's at least one more level of inception at the data center level, where they use AI to optimize power usage (particularly by predictively controlling cooling, and adaptively rescheduling tasks).
reply