> MobileNet-V5-300M
Which makes sense as it's 300M in size and probably far less complex, not a multi billions of parameters transformer.
I guess there's benefit to running that step without subsampling to the initial 256 tokens per image/frame ( https://ai.google.dev/gemma/docs/gemma-3n/model_card#inputs_... ) to go on from that, https://github.com/antimatter15/reverse-engineering-gemma-3n suggests these are 2048 dimensional tokens, which makes these 60 Hz frame digestion rate produce just under 31.5 Million floats-of-your-choosen-precision per second. At least at the high (768x768) input resolution, this is a bit less than one float per pixel.
I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?
Or I could imagine just local on-device visual semantic search by expanding the search query into a bunch of tokens that have some signed desire/want-ness each and where the search tokens get attended to the frame's encoded tokens, activation function'd, scaled (to positive/negative) by the search token's desire score, and then just summed over each frame to get a frame score which can be used for ranking and other such search-related tasks.
(For that last thought, I asked Gemini 2.5 Pro to calculate flops load, and it came out to 1.05 MFLOPS per frame per search token; Reddit suggests the current Pixel's TPU does around 50 TOPS, so if these reasonably match each terminology wise, assuming we're spending about 20% of it's compute on the search/match aspect, it comes out to an unreasonably (-seeming) about 190k tokens the search query could get expanded to. I interpret this result to imply that quality/accuracy issues in the searching/filtering mechanism would hit before encountering throughout issues in this.)
> I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?
the 6-7s I am seeing is what it costs to run an image model, even running on GPU on M4 Max with 64GB of GPU RAM. This repros with my llama.cpp wrapper, and the llama.cpp demo of it.
It is simply getting tokens that is taking that long.
Given that reality, we can ignore it, of course. We could assume the image model does run on Pixel at 60 fps, and there's just no demo APK available, or just say it's all not noteworthy because as the Google employee points out, they can do it inside Google, and external hasn't been prioritized.
The problem is that the blog post is announcing this runs on device at up to 60 fps today, and announces $150K in prizes if you work based on this premise. We have 0 evidence of this externally, the most plausible demo of it released externally by Google is running at 1/500th of this speed, and 1 likely Google employee is saying "yup, it doesn't, we haven't prioritized external users!" The best steelman we can come up with is "well, if eventually the image model runs at 60 fps, we could stream it to an LLM in the cloud with about 4 seconds initiate + prefill latency!"
That's bad.
- Are there APK(s) that run on Tensor?
- Is it possible to run on Tensor if you're not Google?
- Is there anything at all from anyone I can download that'll run it on Tensor?
- If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
No. AiCore service internally uses the inference on Tensor (http://go/android-dev/ai/gemini-nano)
> Is there anything at all from anyone I can download that'll run it on Tensor?
No.
> If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
Mostly because 3P support has not been a engineering priority.
Got it: assuming you're at Google, in eng. parlance, it's okay if it's not Prioritized™ but then product/marketing/whoever shouldn't be publishing posts around the premise it's running 60 fps multimodal experiences on device.
They're very, very, lucky that ratio of people vaguely interested in this, to people follow through on using it, is high, so comments like mine end up at -1.
https://ai.google.dev/edge/litert/android/npu/overview has been identical for a year+ now.
In practice Qualcomm and MediaTek ship working NPU SDKs for third party developers, NNAPI doesn't count and is deprecated anyway.
(n.b. to readers, if you click through, the Google Pixel Tensor API is coming soon. So why in the world has Google been selling Tensor chips in Pixel as some big AI play since...idk, at least 2019?)
On third party model workloads, this is what you will get:
https://ai-benchmark.com/ranking.html
https://browser.geekbench.com/ai-benchmarks (NPU tab, sort w/ quantisation and/or half precision)
Google is clearly not serious on Pixels in practice, and the GPU performance is also behind by quite a lot compared to flagships, which really doesn't help. CPUs are also behind by quite a lot too...
The only one we have works as described, TL;Dr 0.1 fps.