What are the TTS models you know to be faster than Tacotron?

Hi All,

I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.

I faced a problem in getting a GPU to train the reset TTS models. And I see VocGAN is the fastest TTS. So can I train and use VocGAN using a CPU only ? so that I can use it for my research work.

Hi @gmtsehayneh

Welcome to the forum :slightly_smiling_face:

Your question is probably best directed to the developers of the repo you link to, which as far as I know are not associated with the TTS repo here. Likewise for the similar message you posted directly after too.

As a more general point regarding your GPU comment, if you don’t have access to a GPU directly then you may want to look into Google Colabs - it’s free but there are some additional challenges you’d need to work around as they only let the kernel run for 12 hours (so you’d need to save checkpoints before it expired, so you could continue progress when you restart). Best to Google for details about that as it’s somewhat off topic as well.

All the best with the MSc!

1 Like

just a note under this thread. I implemented speedyspeech and with multiband-melgan they provide the fastest TTS inference to my knowledge.

Once available i think you can put Speedyspeech + Hifi-GAN to the list of fast TTS inferences.

Though they refer to TFLite this is a good comparison https://github.com/tulasiram58827/TTS_TFLite

1 Like

I didn’t know that repo. Thx for linking it… looks interesting.

Thanks for these nice informations about inference speed.

@erogol and others, we’d like to train (and release) models for many languages that can run about 100x realtime on a fast GPU. We use these models in our software to help language learners (language learning with Netflix). After reading this thread, the promising options seem to be FastSpeech2/GlowTTS/Speedyspeech + MB-Melgan/Hifi-GAN. Do you have any more specific advice for us? :slight_smile:

Btw, the repo of SpeedySpeech looks promising. @erogol, when you say you implemented it, does it mean it’s somewhere on your TTS mozilla repo ? If no, would you like some help to do so ?

Any model would run that fast on GPU. Even the largest one except WaveRNN.

Yes, SpeedySpeech is implemented in the TTS repo apart from the original repo. So you can give it a shot.

Thanks, I found it.

Did you pretrained some models for it ? (speedyspeech)

Hi everyone, me again.

@erogol, you said that any model would run 100 RTF on modern GPU. However, after some benchmarks, I can’t get above 3 to 4 RTF (for the available pre-trained models of mozilla-TTS)

Hardware : 1 GPU RTX 2080

Tried config :

  • tacotron2 + MB melgan : ~ 3 RTF
  • glowTTS + MB melgan : ~ 3 RTF
  • speedy-speech + MB Melgan : ~4RTF

What is the combination of model that you made that reach this 100 RTF ?

Thanks

I don’t know how you benchmark but these rtfs you shared what I get on a CPU.

I checked to make sure :

  • I’m using one gpu (RTX 2080). The model take around 1GB of ram in the gpu, and the inference use around 60W of power usage.
  • If we cut the process in TTS / Vocoder, the tts take 98% of the time (aroud 0.3 RTF) and the vocoder only 2% of the time (0.006 RTF). MB melgan is really fast !

If we look closer in the TTS, tacotron2 (or speedy speech or glow TTS) take around 30% of the time, and the phonemize() method take 70% of the time. I thought it’s weird so I am making more profiling :
The espeak processing is quiet slow, and I don’t get why exactly atm. I used the python script given on this thread to profile the time spent by espeak only on sentences. The result is that it take between 1 and 2 millisecond (for resp. sentences of 20 characters to 600 characters long). So espeak itself should only be responsible of 0.001 to 0.005 RTF on my machine.

I will notice you when I will find the reason of this slow processing in the espeak processing !

Looks like the phonemizer initialization is taking most of the time. This script come from the phonemizer package.

This initialization occur for each sentences to be predict and was taking, on my machine, around 0.2 RTF !

I try to initialize it at the loading of the model and succred to go from ~0.35 RTF to ~0.15. It’s more than two times faster with just this trick.

The phonemize processing is not only taking 0.05RTF, whereas tacotron2 is taking ~0.1 RTF. Tacotron2 is then the bottleneck in this case. But if we take speedy_speech, the phonemize processing is one more time the bottleneck.

I will continue to dive in this phonemize stuff, and optimize it.

BTW, no one was having, like me, this heavy initializing time problem for the phonemizer ?

Hi @Kirian - I haven’t noticed particular issues on my main PCs although I haven’t dug into it as you have.

I wonder if it’s worth having a play with some of the options? Looking here:

I see that there’s the option for you to set an environment variable for the espeak location. Maybe see if that helps.

One idea that occurs to me is that maybe it’s taking a while to find espeak on your system (eg if you’ve got a lot of locations in your path or for some other reason). If so, setting the variable might give you a speed up. This is just a guess, I haven’t tried it but it would be something to rule out. The environment variable let’s it skip reaching the call to distutils.spawn.find_executable.

I agree it does seem a little inefficient that phonemizer is going through these checks for the binary to call each time, but I suppose that’s the cost of getting what it offers.

Maybe it’s feasible to strip it out and go direct to espeak/espeak-ng for greater efficiencies but that then puts more complexity within TTS directly and it probably makes sense to see further analysis before considering that (eg if others actually do have the same effect as you found and it’s just not noticed by them so far).

Thanks @nmstoker for the advices.

I’ve just tried to hard-code the path to espeak on the machine and nothing changed, the instance initialization still take around 0.3 seconds. I’ve made sure it’s not a problem from mozilla-tts by running some unit timeit in a separate script that only call the phonemize method from the phonemizer.

I think an easy way to fix it is to propose a modification to the phonemizer library : Instead of calling just a phonemize function, the library should give directly access to a class (Phonemizer for instance) that we can first initialize and then call a phonemize method from it.

In the case of mozilla-TTS, it would imply that , when loading and setting the TTS model, the an instance of Phonemizer is created and set available globally. Then, the text2phone could call Phonemizer.phonemize.

I don’t know if it would be usefull because maybe other people don’t have this long instance initialization occurring for each sentence. Maybe you guys could try to add a simple timing function around this instance initialization in phonemize (line 154 to 161 and see how long it take for each inference.

1 Like

Was anyone successful using both mozilla tacotron output with hi-fi gan?

When i’m not completely wrong did @sanjaesc succeed on that. Based on a hifiGAN model from another repo, but with compatible taco2 mel spectograms.

Our TTS models can run on one CPU thread / core decently

Please see our TTS models here - https://github.com/snakers4/silero-models#text-to-speech (corresponding article https://habr.com/ru/post/549482/)

Just let me repost some of the benchmarks here:

  • RTF (Real Time Factor) - time the synthesis takes divided by audio duration;

  • RTS = 1 / RTF (Real Time Speed) - how much the synthesis is “faster” than realtime;

We benchmarked the models on two devices using Pytorch 1.8 utils:

  • CPU - Intel i7-6800K CPU @ 3.40GHz;

  • GPU - 1080 Ti;

  • When measuring CPU performance, we also limited the number of threads used;

For the 16KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |

| --------- | ------------- | ----- | ----- |

| 1         | CPU 1 thread  | 0.7   | 1.4   |

| 1         | CPU 2 threads | 0.4   | 2.3   |

| 1         | CPU 4 threads | 0.3   | 3.1   |

| 4         | CPU 1 thread  | 0.5   | 2.0   |

| 4         | CPU 2 threads | 0.3   | 3.2   |

| 4         | CPU 4 threads | 0.2   | 4.9   |

| ---       | -----------   | ---   | ---   |

| 1         | GPU           | 0.06  | 16.9  |

| 4         | GPU           | 0.02  | 51.7  |

| 8         | GPU           | 0.01  | 79.4  |

| 16        | GPU           | 0.008 | 122.9 |

| 32        | GPU           | 0.006 | 161.2 |

| ---       | -----------   | ---   | ---   |

For the 8KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |

| --------- | ------------- | ----- | ----- |

| 1         | CPU 1 thread  | 0.5   | 1.9   |

| 1         | CPU 2 threads | 0.3   | 3.0   |

| 1         | CPU 4 threads | 0.2   | 4.2   |

| 4         | CPU 1 thread  | 0.4   | 2.8   |

| 4         | CPU 1 threads | 0.2   | 4.4   |

| 4         | CPU 4 threads | 0.1   | 6.6   |

| ---       | -----------   | ---   | ---   |

| 1         | GPU           | 0.06  | 17.5  |

| 4         | GPU           | 0.02  | 55.0  |

| 8         | GPU           | 0.01  | 92.1  |

| 16        | GPU           | 0.007 | 147.7 |

| 32        | GPU           | 0.004 | 227.5 |

| ---       | -----------   | ---   | ---   |

With regards to the German Silero TTS model:

Pros:

  • easy to install
  • good overall quality
  • about real time interference

Cons:

  • no handling of numbers, those are just omitted
  • issues with longer sentences, interference just stops (might be related to warning that sentence has more than 140 chars) or is getting worse at the end of longer

Please note the above benchmarks - you can actually get 5 RTS on a CPU, most likely the model just warms up during the first run

As for the problems listed with the model

no handling of numbers, those are just omitted

there is no text normalization middleware packaged with the models
the model just produces audio from text
it was not included by design

issues with longer sentences, interference just stops (might be related to warning that sentence has more than 140 chars) or is getting worse at the end of longer

this is also by design
model accepts sentences and it can work with batches
see these examples

import torch
import torchaudio

language = 'ru'
speaker = 'kseniya_16khz'
device = torch.device('cpu')
model, symbols, sample_rate, example_text, apply_tts = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                                                      model='silero_tts',language=language,speaker=speaker)
model = model.to(device)  # gpu or cpu

example_text="нав+ерное, существ+уют друг+ие рец+епты, но я их не зн+аю. +или он+и мне не помог+ают. х+очешь моег+о сов+ета - пож+алуйста: сад+ись раб+отать. сл+ава б+огу, так+им л+юдям, как мы с тоб+ой, для раб+оты ничег+о не н+ужно кр+оме бум+аги и карандаш+а."

for i, text in enumerate(example_text.split('. ')):
  audio = apply_tts(texts=[text],
                    model=model,
                    sample_rate=sample_rate,
                    symbols=symbols,
                    device=device)
  torchaudio.save(f'test_{str(i).zfill(2)}.wav',
                  audio[0].unsqueeze(0),
                  sample_rate=16000,
                  bits_per_sample=16)
import torch
import torchaudio

language = 'ru'
speaker = 'kseniya_16khz'
device = torch.device('cpu')
model, symbols, sample_rate, example_text, apply_tts = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                                                      model='silero_tts',language=language,speaker=speaker,
                                                                      force_reload=True)
model = model.to(device)  # gpu or cpu

example_text="нав+ерное, существ+уют друг+ие рец+епты, но я их не зн+аю. +или он+и мне не помог+ают. х+очешь моег+о сов+ета - пож+алуйста: сад+ись раб+отать. сл+ава б+огу, так+им л+юдям, как мы с тоб+ой, для раб+оты ничег+о не н+ужно кр+оме бум+аги и карандаш+а."
example_text = example_text.split('. ')

print(example_text)
audio = apply_tts(texts=example_text,
                  model=model,
                  sample_rate=sample_rate,
                  symbols=symbols,
                  device=device)

FastSpeech and FastSpeech2. Specifically, while FastSpeech2 requires durations, those can be acquired from a forced aligner like MFA. I’ve gotten good results with it, and the RTF (for the previous 44.1KHz model, which I’ve made public) is about 0.083 on my R5 3600. I’ve also found it to be way more resilient against bad datasets, even some that would refuse to align well with Tacotron2 could produce good results.