undefined

points

[-]

I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.

by ilaksh13 hours ago|

parent|

[-]

For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.

by armcat5 hours ago|

parent|

prev|

[-]

Sure, feel free to reach out, just check my profile!

by pettyjohn8 hours ago|

parent|

prev|

[-]

what's your use case and what specific LLMs are you using?

I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!

by nowittyusername4 hours ago|

parent|

[-]

sent!

by 5 hours ago|

prev|

[-]

deleted

by andreadev5 hours ago|

prev|

[-]

The framing in this thread is full-duplex vs composable pipeline, but I think the real architecture is both running simultaneously — and this library is already halfway there.

The fact that qwen3-asr-swift bundles ASR, TTS, and PersonaPlex in one Swift package means you already have all the pieces. PersonaPlex handles the "mouth" — low-latency backchanneling, natural turn-taking, filler responses at RTF 0.87. Meanwhile a separate LLM with tool calling operates as the "brain", and when it returns a result you can fall back to the ASR+LLM+TTS path for the factual answer. taf2's fork (running a parallel LLM to infer when to call tools) already demonstrates this pattern. It's basically how humans work — we say "hmm, let me think about that" while our brain is actually retrieving the answer. We don't go silent for 2 seconds.

The hard unsolved part is the orchestration between the two. When does the brain override the mouth? How do you prevent PersonaPlex from confidently answering something the reasoning model hasn't verified? How do you handle the moment a tool result contradicts what the fast model already started saying?

by _magiic_kards6 hours ago|

prev|

[-]

+1 on this pipeline! You can use a super small model to perform an immediate response and a structured output that pipes into a tool call (which may be a call to a "more intelligent" model) or initiates skill execution. Having this async function with a fast response (TTS) to the user + tool call simultaneously is awesome.

by robotswantdata7 hours ago|

prev|

[-]

+ 1 , agree still prefer composable pipeline architecture for voice agents. The flexibility on switching LLM for cost optimisation or quality is great for scaled use cases.

by biomcgary7 hours ago|

prev|

[-]

Do you know if any of these multi-stage approaches can run on an 8gb M1 Air?

by armcat5 hours ago|

parent|

[-]

They should! If you take Parakeet (ASR), add Qwen 3.5 0.8B (LLM) and Kokoro 82M (TTS), that's about 1.2G + 1.6G + 164M so ~3.5GB (with overhead) on FP16. If you use INT8 or 4-bit versions then are getting down to 1.5-2GB RAM.

And you can always for example swap out the LLM for GPT-5 or Claude.

by scotty799 hours ago|

prev|

[-]

I got PersonaPlex to run on my laptop (a beefy one) just by following the step by step instruction on their github repo.

The uncanny thing is that it reacts to speech faster than a person would. It doesn't say useful stuff and there's no clear path to plugging it into smarter models, but it's worth experiencing.