You're listening to the road and car sounds around you. You're feeling vibration on the road. You're feeling feedback on the steering wheel. You're using a combination of monocular and binocular depth perception - plus, your eyes are not a fixed focal length "cameras". You're moving your head to change the perspective you see the road at. Your inner ear is telling you about your acceleration and orientation.
* someone parking carefully, misjudges depth perception, bumps an object
* person driving at night, their eyes failed to perceive a poorly lit feature of the road/markings/obstacles
* person driving and suddenly blinded by bright object (the sun, bright lights at night)
* person pulling out in traffic who misinterprets their depth perception and therefore misjudges the speed of approaching traffic
* people can only focus their eyes at one distance at a time, and it takes time to focus at a different distance. It is neither unsafe nor unexpected for humans to check their instruments while driving -- but it can take the human eye hundreds of milliseconds to focus under normal circumstances -- If you look down, focus, look back up, and focus, as quick as you can at highway speeds, you will have travelled quite a long distance.
These type of failures can happen not as a result of poor decision making, but of poor perception.
However, there is also a lot of interaction between our perceptual system and cognition. Just for depth perception, we're doing a lot of temporal analysis. We track moving objects and infer distance from assumptions about scale and object permanence. We don't just repeatedly make depth maps from 2D imagery.
The brute-force approach is something like training visual language models (VLMs). E.g. you could train on lots of movies and be able to predict "what happens next" in the imaging world.
But, compared to LLMs, there is a bigger gap between the model and the application domain with VLMs. It may seem like LLMs are being applied to lots of domains, but most are just tiny variations on the same task of "writing what comes next", which is exactly what they were trained on. Unfortunately, driving is not "painting what comes next" in the same way as all these LLM writing hacks. There is still a big gap between that predictive layer, planning, and executing. Our giant corpus of movies does not really provide the ready-made training data to go after those bigger problems.
We often greatly underestimate / undervalue the role of our ears relative to vision. As my film director friend says, 80% of the impact in a movie is in the sound
https://waymo.com/blog/2024/08/meet-the-6th-generation-waymo...
This company claims their LIDAR works conservatively at 250m, and up to 750m depending on reflectivity
https://www.cepton.com/driving-lidar/reading-lidar-specs-par...
Sufficient to build something close to human performance. But self driving cars will be held to a much higher standard by society. A standard only achievable by having sensors like LiDAR.
Whether thats worth completely throwing away LiDAR is a different question, but your argument is just obviously false.
Deciding to crash faster, or "tell human to take over" really fast is NOT better.
It's not only failing, it's causing false positives.
They also have several cameras all around providing constant 360° vision.
Now you might say "use a depth model to estimate metric depth" and I think if you spend 5 minutes thinking about why a magic math box that pretends to recover real depth from a single 2D image is a very very sketchy proposition when you need it to be correct for emergency braking versus some TikTok bokeh filter you will see that also doesn't get you far.
The reports that Tesla submits on Austin Robotaxis include several of them hitting fixed objects. This is the same behavior that has been reported on for prior versions of their software of Teslas not seeing objects, including for the incident for which they had a $250M verdict against them reaffirmed this past week. That this is occurring in an extensively mapped environment and with a safety driver on board leads me to the opposite conclusion that you have reached.