It's likely based on just the transcript, even if it describes visual things, it likely guesses those things from the transcript text only.
Maybe it's better now, but that was how it did it recently. To be convinced that it "watches" the video, I would need to see evidence of it referring to facts that are strictly only possible to know from the video, but not guessable from the audio.