Single socket doesn't necessarily get you away from NUMA anyway, AMD server sockets are 4 way NUMA (you can set it for interleaving, but you could do better with NUMA-aware software), and I think Intel is doing NUMA on server socket as well.
A lot of people like to take one big machine and partition it into several smaller virtual machines. In that case, it shouldn't be too hard to partition vms into NUMA zones? Only vms that are two big to fit in one zone have to worry about it (or that need to be repacked into a different zone)
I would rather it be the other way around, never allow a single process to do anything cross-NUMA unless it asks for that, maybe even be stricter and require a process to opt into using anything but node0. These machines are big enough that you're not going to saturate node0 with random tasks, and you're only going to saturate the whole machine with a more deliberate workload.
I think you can over-analyze this stuff and lose your sanity. On these multicore systems there are also hot cores in the center of the mesh and cold ones at the edges and theoretically you could be doing temperature-aware scheduling, gaining a bit more efficiency in doing so. But it's just easier to adopt the black box model of spherical frictionless CPUs.