In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]
sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.
"how packages were delivered over the last hour", etc.