What are the TTS models you know to be faster than Tacotron?

georroussos · November 7, 2020, 7:56pm

Tacotron2 can achieve impressive results and the benchmarking with LJSpeech does not really show this. With my dataset, which is far from TTS oriented, but has no background noise and completely matching transcriptions, I am able to synthesise speech of up to 5000 characters with minimal to no errors. My goal here is to make my TTS sound as natural as I can.

the secret to not being overwhelmed it to take it slow and try everything

gmtsehayneh · January 22, 2021, 10:48am

Dear All,

I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.

I faced a problem in getting a GPU to train the reset TTS models. And I see TransformerTTS is the fastest TTS. So can I train and use TransformerTTS using a CPU only ? so that I can use it for my research work.

gmtsehayneh · January 22, 2021, 10:50am

Hi All,

I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.

I faced a problem in getting a GPU to train the reset TTS models. And I see VocGAN is the fastest TTS. So can I train and use VocGAN using a CPU only ? so that I can use it for my research work.

nmstoker · January 22, 2021, 12:35pm

Hi @gmtsehayneh

Welcome to the forum

Your question is probably best directed to the developers of the repo you link to, which as far as I know are not associated with the TTS repo here. Likewise for the similar message you posted directly after too.

As a more general point regarding your GPU comment, if you don’t have access to a GPU directly then you may want to look into Google Colabs - it’s free but there are some additional challenges you’d need to work around as they only let the kernel run for 12 hours (so you’d need to save checkpoints before it expired, so you could continue progress when you restart). Best to Google for details about that as it’s somewhat off topic as well.

All the best with the MSc!

erogol · January 26, 2021, 12:48pm

just a note under this thread. I implemented speedyspeech and with multiband-melgan they provide the fastest TTS inference to my knowledge.

TheDayAfter · January 26, 2021, 3:07pm

Once available i think you can put Speedyspeech + Hifi-GAN to the list of fast TTS inferences.

Though they refer to TFLite this is a good comparison https://github.com/tulasiram58827/TTS_TFLite

erogol · January 26, 2021, 4:15pm

I didn’t know that repo. Thx for linking it… looks interesting.

Kirian · January 28, 2021, 11:22pm

Thanks for these nice informations about inference speed.

@erogol and others, we’d like to train (and release) models for many languages that can run about 100x realtime on a fast GPU. We use these models in our software to help language learners (language learning with Netflix). After reading this thread, the promising options seem to be FastSpeech2/GlowTTS/Speedyspeech + MB-Melgan/Hifi-GAN. Do you have any more specific advice for us?

Btw, the repo of SpeedySpeech looks promising. @erogol, when you say you implemented it, does it mean it’s somewhere on your TTS mozilla repo ? If no, would you like some help to do so ?

erogol · January 29, 2021, 1:11am

Any model would run that fast on GPU. Even the largest one except WaveRNN.

Yes, SpeedySpeech is implemented in the TTS repo apart from the original repo. So you can give it a shot.

Kirian · January 29, 2021, 3:02pm

Thanks, I found it.

Did you pretrained some models for it ? (speedyspeech)

Kirian · February 1, 2021, 6:06pm

Hi everyone, me again.

@erogol, you said that any model would run 100 RTF on modern GPU. However, after some benchmarks, I can’t get above 3 to 4 RTF (for the available pre-trained models of mozilla-TTS)

Hardware : 1 GPU RTX 2080

Tried config :

tacotron2 + MB melgan : ~ 3 RTF
glowTTS + MB melgan : ~ 3 RTF
speedy-speech + MB Melgan : ~4RTF

What is the combination of model that you made that reach this 100 RTF ?

Thanks

erogol · February 1, 2021, 6:12pm

I don’t know how you benchmark but these rtfs you shared what I get on a CPU.

Kirian · February 2, 2021, 10:47am

I checked to make sure :

I’m using one gpu (RTX 2080). The model take around 1GB of ram in the gpu, and the inference use around 60W of power usage.
If we cut the process in TTS / Vocoder, the tts take 98% of the time (aroud 0.3 RTF) and the vocoder only 2% of the time (0.006 RTF). MB melgan is really fast !

If we look closer in the TTS, tacotron2 (or speedy speech or glow TTS) take around 30% of the time, and the phonemize() method take 70% of the time. I thought it’s weird so I am making more profiling :
The espeak processing is quiet slow, and I don’t get why exactly atm. I used the python script given on this thread to profile the time spent by espeak only on sentences. The result is that it take between 1 and 2 millisecond (for resp. sentences of 20 characters to 600 characters long). So espeak itself should only be responsible of 0.001 to 0.005 RTF on my machine.

I will notice you when I will find the reason of this slow processing in the espeak processing !

Kirian · February 2, 2021, 12:20pm

Looks like the phonemizer initialization is taking most of the time. This script come from the phonemizer package.

This initialization occur for each sentences to be predict and was taking, on my machine, around 0.2 RTF !

I try to initialize it at the loading of the model and succred to go from ~0.35 RTF to ~0.15. It’s more than two times faster with just this trick.

The phonemize processing is not only taking 0.05RTF, whereas tacotron2 is taking ~0.1 RTF. Tacotron2 is then the bottleneck in this case. But if we take speedy_speech, the phonemize processing is one more time the bottleneck.

I will continue to dive in this phonemize stuff, and optimize it.

BTW, no one was having, like me, this heavy initializing time problem for the phonemizer ?

nmstoker · February 3, 2021, 2:22am

Hi @Kirian - I haven’t noticed particular issues on my main PCs although I haven’t dug into it as you have.

I wonder if it’s worth having a play with some of the options? Looking here:

github.com

bootphon/phonemizer/blob/ae99a5432fa4261a50c26a1fa28c1795bd308017/phonemizer/backend/espeak.py

# Copyright 2015-2021 Mathieu Bernard
#
# This file is part of phonemizer: you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# Phonemizer is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with phonemizer. If not, see <http://www.gnu.org/licenses/>.
"""Espeak backend for the phonemizer"""

import abc
import distutils.spawn
import itertools
import os

This file has been truncated. show original

I see that there’s the option for you to set an environment variable for the espeak location. Maybe see if that helps.

One idea that occurs to me is that maybe it’s taking a while to find espeak on your system (eg if you’ve got a lot of locations in your path or for some other reason). If so, setting the variable might give you a speed up. This is just a guess, I haven’t tried it but it would be something to rule out. The environment variable let’s it skip reaching the call to distutils.spawn.find_executable.

I agree it does seem a little inefficient that phonemizer is going through these checks for the binary to call each time, but I suppose that’s the cost of getting what it offers.

Maybe it’s feasible to strip it out and go direct to espeak/espeak-ng for greater efficiencies but that then puts more complexity within TTS directly and it probably makes sense to see further analysis before considering that (eg if others actually do have the same effect as you found and it’s just not noticed by them so far).

Kirian · February 3, 2021, 2:17pm

Thanks @nmstoker for the advices.

I’ve just tried to hard-code the path to espeak on the machine and nothing changed, the instance initialization still take around 0.3 seconds. I’ve made sure it’s not a problem from mozilla-tts by running some unit timeit in a separate script that only call the phonemize method from the phonemizer.

I think an easy way to fix it is to propose a modification to the phonemizer library : Instead of calling just a phonemize function, the library should give directly access to a class (Phonemizer for instance) that we can first initialize and then call a phonemize method from it.

In the case of mozilla-TTS, it would imply that , when loading and setting the TTS model, the an instance of Phonemizer is created and set available globally. Then, the text2phone could call Phonemizer.phonemize.

I don’t know if it would be usefull because maybe other people don’t have this long instance initialization occurring for each sentence. Maybe you guys could try to add a simple timing function around this instance initialization in phonemize (line 154 to 161 and see how long it take for each inference.

Pak · February 10, 2021, 1:15pm

Was anyone successful using both mozilla tacotron output with hi-fi gan?

mrthorstenm · February 10, 2021, 1:20pm

When i’m not completely wrong did @sanjaesc succeed on that. Based on a hifiGAN model from another repo, but with compatible taco2 mel spectograms.

snakers41 · April 2, 2021, 5:52am

Our TTS models can run on one CPU thread / core decently

Please see our TTS models here - https://github.com/snakers4/silero-models#text-to-speech (corresponding article https://habr.com/ru/post/549482/)

Just let me repost some of the benchmarks here:

RTF (Real Time Factor) - time the synthesis takes divided by audio duration;
RTS = 1 / RTF (Real Time Speed) - how much the synthesis is “faster” than realtime;

We benchmarked the models on two devices using Pytorch 1.8 utils:

CPU - Intel i7-6800K CPU @ 3.40GHz;
GPU - 1080 Ti;
When measuring CPU performance, we also limited the number of threads used;

For the 16KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |

| --------- | ------------- | ----- | ----- |

| 1         | CPU 1 thread  | 0.7   | 1.4   |

| 1         | CPU 2 threads | 0.4   | 2.3   |

| 1         | CPU 4 threads | 0.3   | 3.1   |

| 4         | CPU 1 thread  | 0.5   | 2.0   |

| 4         | CPU 2 threads | 0.3   | 3.2   |

| 4         | CPU 4 threads | 0.2   | 4.9   |

| ---       | -----------   | ---   | ---   |

| 1         | GPU           | 0.06  | 16.9  |

| 4         | GPU           | 0.02  | 51.7  |

| 8         | GPU           | 0.01  | 79.4  |

| 16        | GPU           | 0.008 | 122.9 |

| 32        | GPU           | 0.006 | 161.2 |

| ---       | -----------   | ---   | ---   |

For the 8KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |

| --------- | ------------- | ----- | ----- |

| 1         | CPU 1 thread  | 0.5   | 1.9   |

| 1         | CPU 2 threads | 0.3   | 3.0   |

| 1         | CPU 4 threads | 0.2   | 4.2   |

| 4         | CPU 1 thread  | 0.4   | 2.8   |

| 4         | CPU 1 threads | 0.2   | 4.4   |

| 4         | CPU 4 threads | 0.1   | 6.6   |

| ---       | -----------   | ---   | ---   |

| 1         | GPU           | 0.06  | 17.5  |

| 4         | GPU           | 0.02  | 55.0  |

| 8         | GPU           | 0.01  | 92.1  |

| 16        | GPU           | 0.007 | 147.7 |

| 32        | GPU           | 0.004 | 227.5 |

| ---       | -----------   | ---   | ---   |

TheDayAfter · April 2, 2021, 11:19am

With regards to the German Silero TTS model:

Pros:

easy to install
good overall quality
about real time interference

Cons:

no handling of numbers, those are just omitted
issues with longer sentences, interference just stops (might be related to warning that sentence has more than 140 chars) or is getting worse at the end of longer