undefined

points

[-]

real time audio stem separation is already possible, some specific models can even get around 20ms latency (HS-TasNet) https://github.com/lucidrains/HS-TasNet

There was a nice paper with an overview last year too https://arxiv.org/html/2511.13146v1 that introduced RT-STT which is still being tweaked and built upon in the MSS scene

The high quality ones like MDXNet and Demucs usually have at least several seconds of latency though, but for something like displaying visuals high quality is not really needed and the real time approaches should be fine.

by omneity4 hours ago|

parent|

[-]

I'm pretty sure it should be possible to distill HS-TasNet into a version approximate and fast enough for the purpose of animating LEDs.

At the end it's "just" chunking streamed audio into windows and predicting which LEDs a window should activate. One can build a complex non-realtime pipeline, generate high-quality training data with it, and then train a much smaller model (maybe even an MLP) with it to predict just this task.