I guess there's benefit to running that step without subsampling to the initial 256 tokens per image/frame ( https://ai.google.dev/gemma/docs/gemma-3n/model_card#inputs_... ) to go on from that, https://github.com/antimatter15/reverse-engineering-gemma-3n suggests these are 2048 dimensional tokens, which makes these 60 Hz frame digestion rate produce just under 31.5 Million floats-of-your-choosen-precision per second. At least at the high (768x768) input resolution, this is a bit less than one float per pixel.
I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?
Or I could imagine just local on-device visual semantic search by expanding the search query into a bunch of tokens that have some signed desire/want-ness each and where the search tokens get attended to the frame's encoded tokens, activation function'd, scaled (to positive/negative) by the search token's desire score, and then just summed over each frame to get a frame score which can be used for ranking and other such search-related tasks.
(For that last thought, I asked Gemini 2.5 Pro to calculate flops load, and it came out to 1.05 MFLOPS per frame per search token; Reddit suggests the current Pixel's TPU does around 50 TOPS, so if these reasonably match each terminology wise, assuming we're spending about 20% of it's compute on the search/match aspect, it comes out to an unreasonably (-seeming) about 190k tokens the search query could get expanded to. I interpret this result to imply that quality/accuracy issues in the searching/filtering mechanism would hit before encountering throughout issues in this.)