Another bit of evidence for that: While they merged all the audio chips into a single S-APU chip, and both PPUs and the CPU into the 1CHIP, they never went the final step of merging the APU, PPU and CPU into a single chip. And they never shrunk the PCB to move the two chips closer.
------------
My other theory is that if the audio clock was derived from the video clock, then it would have a different sample rate on NSTC and PAL consoles; By giving it an independent crystal, they can make sure both models have the same audio sample rate.
It's probably a combination of many of these small factors prevented them from ever going to the effort of trying to make it work from a single crystal.