High Quality TTS | Synthesis Time Is Not a Constraint | Pipeline

Hi @erogol and Mozilla TTS members,

We already have a decent quality and fast TTS for conversational applications using Tacotron2 + Melgan architecture.

We now want a very high quality TTS wherein synthesis time is not a constraint at all. Please let us know which pipeline will be most suitable

Also, what all parameters we can tune to make it high quality. Like shall we increase the sampling rate to higher value than 22kHz. Shall we increase the melbins to 160 instead of 80.

Does multispeaker data helps in achieving this. If so, which pipeline?

Hi, how many steps is Taco2 trained for? It is highly probable Taco2 part is good to go anyway. Unless you want to remove unused layers and make the model a bit smaller (if it is possible). The main constraint is vocoding and if you are after speed with okay quality, MelGAN is your only choice right now (but you can try all MelGAN variants and see which one may work better). ParallelWaveGAN is a bit better, but slower. And then after that, things get too slow for your objective I gather. WaveRNN is high quality but slow. And recently we ran some experiments with WaveGrad, but that is also slower.

I think now they’re after the highest quality, irrespective of the processing time. Therefore it sounds like using WaveRNN might actually suit their needs.

@Chak_Mish - I guess you didn’t mean it, but like many new posters your request comes across a bit like “here are my problems, please do my work for me”. I would highlight that this is a free community and the best thing is to look over the forum, look at the repo code, look through issues and ask for clarification about what you’ve found, as then it won’t seem like you’re bringing nothing to the discussion whilst expecting a free lunch.


Hi @georroussos, actually we don’t want a fast TTS. Slower is fine provided the TTS quality is excellent. As I said Taco2 + Melgan is good in terms of balancing quality and time. But in case, there is no time limitation, what is the pipelike to follow. From your suggestion, I guess, you would recommend Taco2+WaveRNN for high quality speech synthesis (albeit a slower)

Hi @nmstoker, yes it’s my first post. Before posting, I did look at the discourse forum quite thoroughly. In fact, I went through several recent research papers as well before posting my query.

What I found in general, even in recent research papers, is that the focus primarily is on speed of synthesis. Everyone claims to reach closer to the ground truth. But main focus of all the recent research papers is to reduce the synthesis times. So primarily they are all proposing non-autoregressive architectures.

As I said, Taco2+Melgan (including it’s variants) are good. But what if the Voice Quality is only criteria, irrespective of synthesis time. I think as per your suggestion TACO2+WAVERNN is possibly the right choice. Is that correct? What are other parameters in Taco2/WaveRNN which can be tuned for high performance?

Like everyone said, the best pipeline is using WaveRNN. WaveRNN is not merged to the project but if you are willing to work on it there is the issue that you can help to integrate it to the TTS repo. I can also help along the way if you have any questions.