I believe we’ve done almost everything practically possible on Tacotron. Mozilla TTS has the most robust public Tacotron implementation so far. However, it is still slightly slow for low-end devices.
It is time for us to go for a new model. I just want to ask your opinion about what model we should use for this next iteration. You can also share some papers if you like.
I guess the first model needs text-to-voice alignment information extracted before. And as any Google TTS paper they do not explain the real deal of the model which is the part that extract linguistic features from the text. I’d guess it is relatively harder to implement and train for different languages.
Definitely don’t forget about LPCNET vocoder, there’s a paper on how to make it faster!
[https://arxiv.org/pdf/2005.05551.pdf] FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction
Speedyspeech has a RTF of about 0.2 to 0.25 on my PC (4 x core i5) without CUDA activated which is impressive and generated audio is good in general. If you feed it with longer sentences it gets unstable towards the end and one can hardly understand what is being said. Another disadvantage is the ‘bad’ performance on arm architecture which i observed.
I dont know who is willing to invest time and resources into implementations like VocGAN without any demos and pretrained models to test. But you are right, 2020 turns to be a wild zoo. My guess: these papers were written by highly developed artificial intelligence
The claimed results for this HiFi-GAN seem impressive:
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis paper, audio samples, source code, pretrained models
×13.44 realtime on CPU (MacBook Pro laptop (Intel i75
CPU 2.6GHz), they list MelGAN at ×6.59)
Seems like a better realtime factor than WaveGrad with RTF = 1.5 on an Intel Xeon CPU (16 cores, 2.3GHz).
Though more iterations (500k to 2500k) were used than in WaveGrad (6 to 1000).
I have skimmed the paper, table 1 shows that their versions v2 and v3 of VocGAN are significantly faster than MelGAN and have a high MOS. I will definitely try it in the upcoming days and report back.
I just started VocGAN training on the german dataset. My machine isn’t the fastest for training, so it will take approximately 4-5 days until the recommended 300 epochs are reached…