upvote
Minimax's new model is quite good. We use their voices for some of our Japanese tutors. The pitch accent is almost perfect.

There are incorrect reading or Chinese readings occasionally, but you can tell when that happens due to the furigana being different

reply
If you have the correct furigana, you could even detect when the TTS model picked the wrong reading and regenerate.

But how do you know the furigana are correct? Unless you start out fully human-annotated text, you need some automated procedure to add furigana, which pushes the problem from "TTS AI picked the wrong reading" to "furigana AI picked the wrong reading."

reply
Yes it pushes the problem, but it's a much easier problem, and models like Gemini flash 2.5 do very well.
reply
Yeah Japanese TTS is a lot harder than it looks. I’m also building a language learning application, and constantly ran into incorrect readings. Eleven labs, eleven labs v3, OpenAI, play.ht, azure, google, Polly — I’ve tried them all. They are all really bad (more than 1/3 the expressions had an error in them somewhere).

It _is_ fixable though. It took me about a week, but I have yet to find a mistaken reading now. This also seems to just be the case with Japanese - most tonal languages seem to have the correct tones (I’m not qualified to comment on how natural the tones sound, but I have yet to find a mismatch like in Japanese)

reply
Yes. AI transcription is great, AI translation is OK (depending on language pair), but TTS is still pretty awful for most languages.
reply
Also a Japanese learner here—albeit a beginner. As I understand it, the pitch accent is about stress, languages can stress a syllable with length, volume, pitch, etc. Spanish uses vowel length, Icelandic uses volume, English uses a combination of length and volume, and Swedish (just like Japanese) uses pitch. Just like in English if you put the wrong stress on the word it can range anything from sounding foreign to being incomprehensible. (Aside: I always remember trying to say the name of the band Duran Duran to an English speaker, while putting the stress on the first syllable like is normal in Icelandic, but my listener had no idea what I was saying, it took probably 30 attempts before I was corrected with the correct stress).

I think Japanese is somewhat special though for a large number of homonyms (i.e. words that are spelled the same) so speaking with the correct pitch becomes somewhat more important.

reply
Somewhat more important, but as someone with decent Japanese who knows about pitch accent but can barely hear the difference in real time, and never actively learned it except for the few well known examples like bridge/chopstick, I don't think it matters all that much. Yes, you'll sound foreign. But you'll be understood nevertheless, in the vast majority of cases.
reply
Speaking of bridge/chopsticks, I created a video to try to spot the difference my self a couple of months ago:

https://imgur.com/KJXanqc

reply
Here's the problem: pitch accent is easy to hear in isolation and/or in comparison. Under real life conditions, in the middle of a sentence, it's a completely different experience. But then you're saved by context. Because candy is most likely not falling from the sky. Homophones that are still ambiguous in context are possible, but a rare occurrence in my experience.
reply