upvote
Same, I really like the ONNX format. I only wish that they weren't so frustratingly difficult to use on Apple iOS. Their browser engine, WebKit, has become annoyingly restrictive over the years in terms of the working memory cap.

I ran into quite a few out-of-memory iOS safari issues when I was building continuous voice recognition for my blind chess game, so people could play while on the go.

reply
Interesting, what use cases are you using onnx for btw?
reply
So I use a VAD onnx (Silero [1]) to automatically detect when someone is talking, and then it sends the audio into one of the voice recognition libraries.

I originally tried to get away with just Whisper Tiny in the chess game [2], but it performs worse on the kinds of short phrases (knight E4, c takes d5, etc) used to dictate chess notation. Even with hotword-based phrasing and corrections, I found its accuracy on brief inputs noticeably poorer. So I switched over to Sherpa [3] trained on gigaspeech. It’s significantly more accurate, but it also comes with a correspondingly larger memory footprint.

Ideally, I would have used just one engine, but I needed a fallback for iOS devices (especially older ones) which can easily OOM.

[1] - https://github.com/snakers4/silero-vad

[2] - https://shahkur.specr.net

[3] - https://github.com/k2-fsa/sherpa-onnx

reply
RNNoise has a VAD inbuilt that works much better than silero.

https://github.com/xiph/rnnoise

reply
Most ONNX files are fp32, but the ONNX format actually allows fp16, int8, etc. as well (see onnx.proto for the full list of dtypes [1] - they even have fp8/fp4 these days!). I ended up switching over to fp16 ONNX models for my own web-based inference project since the quality is ~identical and page loads get 2x faster.

[1] https://github.com/onnx/onnx/blob/main/onnx/onnx.proto#L605

reply
Thanks for the pointer actually. I need to take a look at this version of the spec.
reply
Yeah it's pretty cool what a 2gb NN can do from a single image
reply