Why can't Waymo ALSO develop the same smarts and just also solve the sensor fusion issue such that they can use the right set of sensors in the right environmental conditions, and then leapfrog Tesla's capabilities?
Because this part is really hard, and that's why Tesla abandoned the fusion approach. You cannot possibly foresee all the conditions in which LIDAR or any active sensor will malfunction/return wrong data/return data that's only slightly off for that ONE specific time. And even if it doesn't, you need to trust it to not return noise. And when it does return noise, how do you classify it as noise?
Cameras are passive sensors - they get whatever light comes in and turn it into an image. Camera is capturing shapes that make sense to the neural nets: it's working. See all black/white/red/cannot see any shapes? Camera is not working, exclude it from the currently used set of sensors or weigh it less when applying decisions, because it's returning no signal (and yes, neural nets have their own set of problems).
EDIT: cameras also provide more continuous context: if 1 pixel is off, is clearly bright red in a mostly-green scene where no poles can be identified, the neural net will average it out and discard it as noise. If 1 pixel says "object" in LIDAR, do you trust it to be correct? Perhaps the ray just hit a bird or a fly, but you only see a point, it's a lossy summary of the information you need.
As is, Waymo's playing it smarter than Cruise did, but they're not all in on AI yet. So I don't expect them to "leapfrog Tesla" in that dimension - and it's the key dimension to self-driving.
Tesla trains it models from actual drivers purely based on (input) Vision and (output) actuators - Brake, Steering, Accelerators.
Human output is based on what they and the camera sees. So, it's a 1:1 match.
If Waymo were to do that, it'll muddle the training set. The Lidar input may override camera input.
I always struggled when Musk mentioned Lidar will make it ambiguous. It didn't make any sense to me why having a secondary failback sensor messes things. But, if you put it in the training data context, it absolutely makes sense.
Just because the human in the scenario only took vision as input, why does that matter to the training data and the model? The actions are the same.
To put it another way, what about all the cultural context the human had, or the sounds, smells, past experiences at the same intersection, etc? Even Tesla can't record this, but I'm not sure that matters.
I'm working on a similar problem in computer vision and we're quickly approaching the point where our pure vision work is better than our Lidar supported track because we've had to deal with the constraints instead of having a crutch to lean on.
Tesla wants to make EVs that look like normal cars (Cybertruck being the oddball here, admittedly).
You can have even more intelligence with both.