upvote
I've had the pleasure of working with some truly fast pieces of code written by experts. It's always both. You have to have a good sense of what's generally fast and what's not in order to design a system that doesn't contain intractable bottlenecks. And once you have a good design you can profile and optimize the remaining constraints.

But e.g. if you want to do fast math, you really need to design your pipeline around cache efficiency from the beginning – it's very hard to retrofit. Whereas reducing memory allocations in order to make parallel algorithms faster is something you can usually do after profiling.

reply
Yeah, the latency numbers provide a ceiling for your algorithm. The actual performance depends on the implementation, code generation, runtime hazards, small dependencies one may have overlooked etc.
reply
I mean...you should always design with speed in mind (In that Jeff Dean sense :) but what 'premature optimization' is referring to, is more like localized speed optimizations/hacks. Don't do those until a) you know you'll need it and b) you know where it will help.
reply