ForwardTacotron experience

I’ve been testing ForwardTacotron from this repo: and it works very well even with very long sentences. Basically you can feed whole paragraphs and you don’t get typical attention problems, such as stuttering.

FastTacotron replaces the attention mechanism of Tacotron with duration prediction from the FastSpeech paper. I believe that the transformer network used in FastSpeech paper is slow and produces subpar speech, but with Tacotron type network the speech quality is better and it’s really fast.
@erogol you may want to test this for TTS.

Thx for the link. We sometimes meet with those guys.

Regarding the model,

Their pronunciation quality is not on par with what we have with TTS. Also, even-though, you can set the speech speed with that architecture, it does not have the same level of control over the prosody since it only predicts duration and nothing about the level of emphasis per step.

Also the latest TTS models have no attention problem as well even for long sequences.

Do you have any numbers in terms of inference time comparison? As of my implementation below, I don’t really see a huge boost with that model.

I have a very similar model but the implementation is not merged to TTS. it is here if your are interested

Hello. I’m also very interested in any strategy that makes it more possible to run a decent TTS system on limited hardware (for intance, ARM devices).

MS’s FastSpeech paper seems to be very decent.

I’ve listened to the samples and it’s possible to argue that Mozilla’s demo sound better. However, it takes too long to generate sound and I’d rather have a not too great voice than no voice at all. If I try the usual local open source systems (espeak, maryTTS, flite) I see that they sound horrible. FastSpeech is much better than that.

@geneing 1) The WaveRNN vocoder version sound seems to be much better than the Griffin-Lim vocoder version. How long are the generating times compared? I want to use it with a low power with only the CPU. Is is possible?

  1. Is it possible to train another language using your implementation? How could I do it, is there a step by step instruction on how to do it somewhere?

Thanks for your great work!

@erogol There must be some way of chosing the ForwardTacotron (or other FastSpeech inpired implementations) when using Mozilla’s TTS. This could pave the way to use it on low power devices and voice assistants and pave the way for the “Open Web” and to more privacy. Are there other people at Mozilla that are concerned about the high requirements of the current Mozilla TTS implementations?

Hi, there!
I just recently stumbled across this pretty fresh paper AlignTTS, they claim to achieve 4.05 MOS(vs 3.88 on ForvardTacotron) and almost the same speed, thought it might be interesting to you. I’m curious, what kind of devices do you have in mind? :slight_smile:

TTS’s requirements are not high relative to the other alternatives.

You can train an TTS model + vocoder basically in total ~9 million parameters (Tacotron1 and ParallelWaveGAN) and these two can work in real-time even on CPU machines (I did not try Raspi and it is out of scope for now).

Forward Tacotron does not give you a huge boost since it is a very large model and if we mention the model in MS’s paper, it uses transformer modules which are quite expensive to run. So being that large, this model is a larger foot print in memory.

I’d suggest comparing models one-to-one before saying anything further. If you already did that please share your numbers.

And if you are willing to, please don’t hesitate to contribute these new models to Mozilla TTS. I can help along the way but TTS is mostly a single man project and I only have 2 hands :slight_smile:


AlignTTS looks promising

At this point some praise for a change: Thank you very much for all the work you have done and will do for making a TTS system available “to everyone”. I would like to contribute in the future, but still the learning curve is steep :slight_smile:

I’ve tried to read the paper. It seems interesting, but I lack the knowledge in this area. Do you know if there is code available? I couldn’t find any.

The device that I have in mind is a Raspberry Pi 3 or 4, or another ARM soc like those.

I’m trying to learn how to do that. So Tacotron1 + ParallelWaveGan could be a good bet. Do you have any thoughts or pointers on the superficial differences among: WaveGlow, ParallelWaveGan, and other vocoders? The only thing I gathered is that most don’t do inference in parallel (and a parallel implementation would give a little boost).

What do you mean by comparing models one-to-one? Do you mean Tacotron1 vs Tacotron2 vs Forwardtocotron with the same vocoders?

Hi @kms - me too (i.e. read the paper, did not find code, wrote to the authors… no response) I would like to implement AlignTTS. The paper isn’t that “self-explanatory” I would say :smiley: So, why not working on this together? Would be more fun, too! I think I have a coarse idea of the method, only the Viterbi part isn’t entirely clear to me.

I have to object also, that training might be tiring… a transformer architecture from scratch might take A LOT OF computing resources (which I don’t have, own only a RTX 2070)

What do you think? Or anyone else?

What about SqueezeWave? They have code available in GitHub, but I couldn’t even install the requirements because of some conflicting libraries.

@kms am curious: could you share details of the specific requirements you had difficulties with? And on which platform? Based on your earlier comments it sounds like it may be on Raspbian but wasn’t sure.

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

Just by reading the title (and the abstract :wink: you can distinguish it from e.g. AlignTTS - SqueezeWave is a WaveGlow-oriented Vocoder, whereas AlignTTS is as the name says a text 2 speech method, which makes use of a Vocoder but most important generates the input for any vocoder (mel spectograms or the like).

Thus, thanks for the vocoder link - probably useful. But I was rather thinking of TTS :slight_smile: