Tacotron2 can achieve impressive results and the benchmarking with LJSpeech does not really show this. With my dataset, which is far from TTS oriented, but has no background noise and completely matching transcriptions, I am able to synthesise speech of up to 5000 characters with minimal to no errors. My goal here is to make my TTS sound as natural as I can.
the secret to not being overwhelmed it to take it slow and try everything
I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.
I faced a problem in getting a GPU to train the reset TTS models. And I see TransformerTTS is the fastest TTS. So can I train and use TransformerTTS using a CPU only ? so that I can use it for my research work.
I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.
I faced a problem in getting a GPU to train the reset TTS models. And I see VocGAN is the fastest TTS. So can I train and use VocGAN using a CPU only ? so that I can use it for my research work.
Your question is probably best directed to the developers of the repo you link to, which as far as I know are not associated with the TTS repo here. Likewise for the similar message you posted directly after too.
As a more general point regarding your GPU comment, if you don’t have access to a GPU directly then you may want to look into Google Colabs - it’s free but there are some additional challenges you’d need to work around as they only let the kernel run for 12 hours (so you’d need to save checkpoints before it expired, so you could continue progress when you restart). Best to Google for details about that as it’s somewhat off topic as well.
Thanks for these nice informations about inference speed.
@erogol and others, we’d like to train (and release) models for many languages that can run about 100x realtime on a fast GPU. We use these models in our software to help language learners (language learning with Netflix). After reading this thread, the promising options seem to be FastSpeech2/GlowTTS/Speedyspeech + MB-Melgan/Hifi-GAN. Do you have any more specific advice for us?
Btw, the repo of SpeedySpeech looks promising. @erogol, when you say you implemented it, does it mean it’s somewhere on your TTS mozilla repo ? If no, would you like some help to do so ?
@erogol, you said that any model would run 100 RTF on modern GPU. However, after some benchmarks, I can’t get above 3 to 4 RTF (for the available pre-trained models of mozilla-TTS)
Hardware : 1 GPU RTX 2080
Tried config :
tacotron2 + MB melgan : ~ 3 RTF
glowTTS + MB melgan : ~ 3 RTF
speedy-speech + MB Melgan : ~4RTF
What is the combination of model that you made that reach this 100 RTF ?
I’m using one gpu (RTX 2080). The model take around 1GB of ram in the gpu, and the inference use around 60W of power usage.
If we cut the process in TTS / Vocoder, the tts take 98% of the time (aroud 0.3 RTF) and the vocoder only 2% of the time (0.006 RTF). MB melgan is really fast !
If we look closer in the TTS, tacotron2 (or speedy speech or glow TTS) take around 30% of the time, and the phonemize() method take 70% of the time. I thought it’s weird so I am making more profiling :
The espeak processing is quiet slow, and I don’t get why exactly atm. I used the python script given on this thread to profile the time spent by espeak only on sentences. The result is that it take between 1 and 2 millisecond (for resp. sentences of 20 characters to 600 characters long). So espeak itself should only be responsible of 0.001 to 0.005 RTF on my machine.
I will notice you when I will find the reason of this slow processing in the espeak processing !
This initialization occur for each sentences to be predict and was taking, on my machine, around 0.2 RTF !
I try to initialize it at the loading of the model and succred to go from ~0.35 RTF to ~0.15. It’s more than two times faster with just this trick.
The phonemize processing is not only taking 0.05RTF, whereas tacotron2 is taking ~0.1 RTF. Tacotron2 is then the bottleneck in this case. But if we take speedy_speech, the phonemize processing is one more time the bottleneck.
I will continue to dive in this phonemize stuff, and optimize it.
BTW, no one was having, like me, this heavy initializing time problem for the phonemizer ?
Hi @Kirian - I haven’t noticed particular issues on my main PCs although I haven’t dug into it as you have.
I wonder if it’s worth having a play with some of the options? Looking here:
I see that there’s the option for you to set an environment variable for the espeak location. Maybe see if that helps.
One idea that occurs to me is that maybe it’s taking a while to find espeak on your system (eg if you’ve got a lot of locations in your path or for some other reason). If so, setting the variable might give you a speed up. This is just a guess, I haven’t tried it but it would be something to rule out. The environment variable let’s it skip reaching the call to distutils.spawn.find_executable.
I agree it does seem a little inefficient that phonemizer is going through these checks for the binary to call each time, but I suppose that’s the cost of getting what it offers.
Maybe it’s feasible to strip it out and go direct to espeak/espeak-ng for greater efficiencies but that then puts more complexity within TTS directly and it probably makes sense to see further analysis before considering that (eg if others actually do have the same effect as you found and it’s just not noticed by them so far).
I’ve just tried to hard-code the path to espeak on the machine and nothing changed, the instance initialization still take around 0.3 seconds. I’ve made sure it’s not a problem from mozilla-tts by running some unit timeit in a separate script that only call the phonemize method from the phonemizer.
I think an easy way to fix it is to propose a modification to the phonemizer library : Instead of calling just a phonemize function, the library should give directly access to a class (Phonemizer for instance) that we can first initialize and then call a phonemize method from it.
In the case of mozilla-TTS, it would imply that , when loading and setting the TTS model, the an instance of Phonemizer is created and set available globally. Then, the text2phone could call Phonemizer.phonemize.
I don’t know if it would be usefull because maybe other people don’t have this long instance initialization occurring for each sentence. Maybe you guys could try to add a simple timing function around this instance initialization in phonemize (line 154 to 161 and see how long it take for each inference.