upvote
I think I'm misunderstanding - they're converting video into their representation which was bootstrapped with LIDAR, video and other sensors. I feel you're alluding to Tesla, but Tesla could never have this outcome since they never had a LIDAR phase.

(edit - I'm referring to deployed Tesla vehicles, I don't know what their research fleet comprises, but other commenters explain that this fleet does collect LIDAR)

reply
They can and they do.

https://youtu.be/LFh9GAzHg1c?t=872

They've also built it into a full neural simulator.

https://youtu.be/LFh9GAzHg1c?t=1063

I think what we are seeing is that they both converged on the correct approach, one of them decided to talk about it, and it triggered disclosure all around since nobody wants to be seen as lagging.

reply
I watched that video around both timestamps and didn't see or hear any mention of LIDAR, only of video.
reply
Exactly: they convert video into a world model representation suitable for 3D exploration and simulation without using LIDAR (except perhaps for scale calibration).
reply
My mistake - I misinterpreted your comment, but after re-reading more carefully, it's clear that the video confirms exactly what you said.
reply
tesla is not impressive, I would never put my child in one
reply
Tesla does collect LIDAR data (people have seen them doing it, it's just not on all of the cars) and they do generate depth maps from sensor data, but from the examples I've seen it is much lower resolution than these Waymo examples.
reply
Tesla does it to map the areas to come up with high def maps for areas where their cars try to operate.
reply
Tesla uses lidar to train their models to generate depth data out of camera input. I don’t think they have any high definition maps.
reply
The purpose of lidar is to prove error correction when you need it most in terms of camera accuracy loss.

Humans do this, just in the sense of depth perception with both eyes.

reply
Human depth perception uses stereo out to only about 2 or 3 meters, after which the distance between your eyes is not a useful baseline. Beyond 3m we use context clues and depth from motion when available.
reply
Thanks, saved some work.

And I'll add that it in practice it is not even that much unless you're doing some serious training, like a professional athlete. For most tasks, the accurate depth perception from this fades around the length of the arms.

reply
ok, but a care is a few meters wide, isn't that enough for driving depth perception similar to humans
reply
The depths you are trying to estimate are to the other cars, people, turnings, obstacles, etc. Could be 100m away or more on the highway.
reply
ok, but the point trying to be made is based on human's depth perception, but a car's basic limitation is the width of the vehicle, so there's missing information if you're trying to figure out if a car can use cameras to do what human eyes/brains do.
reply
(Always worth noting, human depth perception is not just based on stereoscopic vision, but also with focal distance, which is why so many people get simulator sickness from stereoscopic 3d VR)
reply
In fact there are even more depth perception clues. Maybe the most obvious is size (retinal versus assumed real world size). Further examples include motion parallax, linear perspective, occlusion, shadows, and light gradients

Here is a study on how these effects rank when it’s comes to (hand) reaching tasks in VR: https://pubmed.ncbi.nlm.nih.gov/29293512/

reply
> Always worth noting, human depth perception is not just based on stereoscopic vision, but also with focal distance

Also subtle head and eye movements, which is something a lot of people like to ignore when discussing camera-based autonomy. Your eyes are always moving around which changes the perspective and gives a much better view of depth as we observe parallax effects. If you need a better view in a given direction you can turn or move your head. Fixed cameras mounted to a car's windshield can't do either of those things, so you need many more of them at higher resolutions to even come close to the amount of data the human eye can gather.

reply
I keep wondering about the focal depth problem. It feels potentially solvable, but I have no idea how. I keep wondering if it could be as simple as a Magic Eye Autostereogram sort of thing, but I don't think that's it.

There have been a few attempts at solving this, but I assume that for some optical reason actual lenses need to be adjusted and it can't just be a change in the image? Meta had "Varifocal HMDs" being shown off for a bit, which I think literally moved the screen back and forth. There were a couple of "Multifocal" attempts with multiple stacked displays, but that seemed crazy. Computer Generated Holography sounded very promising, but I don't know if a good one has ever been built. A startup called Creal claimed to be able to use "digital light fields", which basically project stuff right onto the retina, which sounds kinda hogwashy to me but maybe it works?

reply
Actually the reason people experience vection in VR is not focal depth but the dissonance between what their eyes are telling them and what their inner ear and tactile senses are telling them.

It's possible they get headaches from the focal length issues but that's different.

reply
My understanding is that contextual clues are a big part of it too. We see a the pitcher wind up and throw a baseball as us more than we stereoscopically track its progress from the mound to the plate.

More subtly, a lot of depth information comes from how big we expect things to be, since everyday life is full of things we intuitively know the sizes of, frames of reference in the form of people, vehicles, furniture, etc . This is why the forced perspective of theme park castles is so effective— our brains want to see those upper windows as full sized, so we see the thing as 2-3x bigger than it actually is. And in the other direction, a lot of buildings in Las Vegas are further away than they look because hotels like the Bellagio have large black boxes on them that group a 2x2 block of the actual room windows.

reply
Another way humans perceive depth is by moving our heads and perceiving parallax.
reply
How expensive is their lidar system?
reply
Hesai has driven the cost into the $200 to 400 range now. That said I don't know what they cost for the ones needed for driving. Either way we've gone from thousands or tens of thousands into the hundreds dollar range now.
reply
Looking at prices, I think you are wrong and automotive Lidar is still in the 4 to 5 figure range. HESAI might ship Lidar units that cheap, but automotive grade still seems quite expensive: https://www.cratustech.com/shop/lidar/
reply
Those are single unit prices. The AT128 for instance, which is listed at $6250 there and widely used by several Chinese car companies was around $900 per unit in high volume and over time they lowered that to around $400.

The next generation of that, the ATX, is the one they have said would be half that cost. According to regulator filings in China BYD will be using this on entry level $10k cars.

Hesai got the price down for their new generation by several optimizations. They are using their own designs for lasers, receivers, and driver chips which reduced component counts and material costs. They have stepped up production to 1.5 million units a year giving them mass production efficiencies.

reply
That model only has a 120 degree field of view so you'd need 3-4 of them per car (plus others for blind spots, they sell units for that too). That puts the total system cost in the low thousands, not the 200 to 400 stated by GP. I'm not saying it hasn't gotten cheaper or won't keep getting cheaper, it just doesn't seem that cheap yet.
reply
Waymo does their LiDAR in-house, so unfortunately we don’t know the specs or the cost
reply
Otto and Uber and the CEO of https://pronto.ai do though (tongue-in-cheek)

> Then, in December 2016, Waymo received evidence suggesting that Otto and Uber were actually using Waymo’s trade secrets and patented LiDAR designs. On December 13, Waymo received an email from one of its LiDAR-component vendors. The email, which a Waymo employee was copied on, was titled OTTO FILES and its recipients included an email alias indicating that the thread was a discussion among members of the vendor’s “Uber” team. Attached to the email was a machine drawing of what purported to be an Otto circuit board (the “Replicated Board”) that bore a striking resemblance to – and shared several unique characteristics with – Waymo’s highly confidential current-generation LiDAR circuit board, the design of which had been downloaded by Mr. Levandowski before his resignation.

The presiding judge, Alsup, said, "this is the biggest trade secret crime I have ever seen. This was not small. This was massive in scale."

(Pronto connection: Levandowski got pardoned by Trump and is CEO of Pronto autonomous vehicles.)

https://arstechnica.com/tech-policy/2017/02/waymo-googles-se...

reply
We know Waymo reduced their LiDAR price from $75,000 to ~$7500 back in 2017 when they started designing them in-house: https://arstechnica.com/cars/2017/01/googles-waymo-invests-i...

That was 2 generations of hardware ago (4th gen Chrysler Pacificas). They are about to introduce 6th gen hardware. It's a safe bet that it's much cheaper now, given how mass produced LiDARs cost ~$200.

reply
Less than the lives it saves.
reply
Cheaper every year.
reply
Exactly.

Tesla told us their strategy was vertical integration and scale to drive down all input costs in manufacturing these vehicles...

...oh, except lidar, that's going to be expensive forever, for some reason?

reply
> Humans do this, just in the sense of depth perception with both eyes.

Humans do this with vibes and instincts, not just depth perception. When I can't see the lines on the road because there's too much slow, I can still interpret where they would be based on my familiarity with the roads and my implicit knowledge of how roads work, e.g. We do similar things for heavy rain or fog, although, sometimes those situations truly necessitate pulling over or slowing down and turning on your 4s - lidar might genuinely given an advantage there.

reply
That’s the purpose of the neural networks
reply
Yes and no - vibes and instincts isn't just thought, it's real senses. Humans have a lot of senses; dozens of them. Including balance, pain, sense of passage of time, and body orientation. Not all of these senses are represented in autonomous vehicles, and it's not really clear how the brain mashes together all these senses to make decisions.
reply
deleted
reply
That is still important for safety reasons in case someone uses a LiDAR jamming system to try to force you into an accident.
reply
It’s way easier to “jam” a camera with bright light than a lidar, which uses both narrow band optical filters and pulsed signals with filters to detect that temporal sequence. If I were an adversary, going after cameras is way way easier.
reply
Oh yeah, point a q-beam at a Tesla at night, lol. Blindness!
reply
If somebody wants to hurt you while you are traveling in a car, there are simpler ways.
reply
I think there are two steps here: converting video to sensor data input, and using that sensor data to drive. Only the second step will be handled by cars on road, first one is purely for training.
reply
Autonomous cars need to be significantly better than humans to be fully accepted especially when an accident does happen. Hence limiting yourself to only cameras is futile.
reply
They may be trying to suggest that, that claim does not follow from the quoted statement.
reply
I've always wondered... if Lidar + Cameras is always making the right decision, you should theoretically be able to take the output of the Lidar + Cameras model and use it as training data for a Camera only model.
reply
That's exactly what Tesla is doing with their validation vehicles, the ones with Lidar towers on top. They establish the "ground truth" from Lidar and use that to train and/or test the vision model. Presumably more "test", since they've most often been seen in Robotaxi service expansion areas shortly before fleet deployment.
reply
Is that exactly true though? Can you give a reference for that?
reply
I don't have a specific source, no. I think it was mentioned in one of their presentation a few years back, that they use various techniques for "ground truth" for vision training, among those was time series (depth change over time should be continuous etc) and iirc also "external" sources for depth data, like LiDAR. And their validation cars equipped with LiDAR towers are definitely being seen everywhere they are rolling out their Robotaxi services.
reply
are definitely being seen everywhere they are rolling out their Robotaxi services

So...nowhere?

reply
deleted
reply
> you should theoretically be able to take the output of the Lidar + Cameras model and use it as training data for a Camera only model.

Why should you be able to do that exactly? Human vision is frequently tricked by it's lack of depth data.

reply
"Exactly" is impossible: there are multiple Lidar samples that would map to the same camera sample. But what training would do is build a model that could infer the most likely Lidar representation from a camera representation. There would still be cases where the most likely Lidar for a camera input isn't a useful/good representation of reality, e.g. a scene with very high dynamic range.
reply
No, I don't think that will be successful. Consider a day where the temperature and humidity is just right to make tail pipe exhaust form dense fog clouds. That will be opaque or nearly so to a camera, transparent to a radar, and I would assume something in between to a lidar. Multi-modal sensor fusion is always going to be more reliable at classifying some kinds of challenging scene segments. It doesn't take long to imagine many other scenarios where fusing the returns of multiple sensors is going to greatly increase classification accuracy.
reply
Sure, but those models would never have online access to information only provided in lidar data…
reply
No, but if you run a shadow or offline camera-only model in parallel with a camera + LIDAR model, you can (1) measure how much worse the camera-only model is so you can decide when (if ever) it's safe enough to stop installing LIDAR, and (2) look at the specific inputs for which the models diverge and focus on improving the camera-only model in those situations.
reply