I checked to make sure :
- I’m using one gpu (RTX 2080). The model take around 1GB of ram in the gpu, and the inference use around 60W of power usage.
- If we cut the process in TTS / Vocoder, the tts take 98% of the time (aroud 0.3 RTF) and the vocoder only 2% of the time (0.006 RTF). MB melgan is really fast !
If we look closer in the TTS, tacotron2 (or speedy speech or glow TTS) take around 30% of the time, and the phonemize() method take 70% of the time. I thought it’s weird so I am making more profiling :
The espeak processing is quiet slow, and I don’t get why exactly atm. I used the python script given on this thread to profile the time spent by espeak only on sentences. The result is that it take between 1 and 2 millisecond (for resp. sentences of 20 characters to 600 characters long). So espeak itself should only be responsible of 0.001 to 0.005 RTF on my machine.
I will notice you when I will find the reason of this slow processing in the espeak processing !