What would hope to be achieved by making this case lazy? If you wanted these to run in parallel, with a multi-gpu system, you would use the appropriate parallel interface.
.abs().max().item()
of something that can be identified as definitionally zero.The parallel interface, which is async, is probably what you're lookin for.
If evaluation is lazy, then the subtraction operator gets fed two unevaluated matrix multiplies.
If it's a dumb subtraction operator, this gives us no benefit. Eventually it evaluates both and then subtracts. And it has some extra overhead like you said.
But if it's a smart subtraction operator, it can realize that both parameters are the same equation, and then it can return all 0s without evaluating anything.
And even better than just skipping the matrix math, "all 0s" can be a stub object that takes O(1) time to set up. And then .abs().max() will be instant too.