Intelligence is super key here, especially as the context size gets larger (due to memory) and intelligence degrades.
Another major issue is TTS voice quality, but this seems to be improving a lot for small local models.
EDIT: You're right, latency is also a big deal. You need to get each piece under a second, and the LLM part would be especially slow on mobile devices.