Tacotron 1+ LibriTTS multispeaker model

Just thought I’d share some simple observations, would enjoy hearing from other to see if what I am seeing makes sense.

These are some of the better samples from a tt1 model (no bn, no fwd attn and using phonemes) on a random 500 speaker subset of LibriTTS:
soundcloud link

Some observations:
-TT1 > TT2 for any configuration I try, convergence is much quicker, quality is clearly much better.

-Some speakers sound pretty good, some sound really “borgy”. More males sound borgy than females (I set mel_fmin to 50.0, not sure how to set with a mixed sex dataset). I’m still not totally sure what contributes the most to “borginess”.

-I used to wonder if Griffin-Lim was my problem, but some of the samples sound reasonably good. I am sure they can be improved with better vocoding, but I don’t think better vocoding can ‘rescue’ models that sound worse than the samples I currently have.

Have people been able to do much better with multi speaker? I believe the average length of a speaker is reasonably short in LibriTTS (~21 min IIRC), but I haven’t been able to get high quality across all speakers. I’ve tried with TT2 and the WaveRNN universal vocoder without any luck so far (I’m currently trying to train my own WaveRNN model).

Screen Shot 2020-05-26 at 12.46.21 PM|587x500

It is very interesting that you get better results with Taco and not Taco2; my experience is the exact opposite, both single and multispeaker. But, I use @edresson1’s fork that is multispeaker oriented and includes GST support for Taco2 as well. However, I can see why Taco might give you better results much quicker (smaller network).

I wonder why you disabled forward attention; I would bet enabling it would improve performance. I would also totally recommend checking out Edresson’s fork if you’re very serious about multispeaker.

The length of every speaker shouldn’t be a problem, because the dataset is large and the alignments concern all speakers in the end; so if the vocabulary is large, it is fine.

Vocoding may help, but not always. In short, for multispeaker, if you want good results, what matters is a good embeddings representation, a reasonably large dataset (minimum VCTK duration which is 109 hours) which is also of good recording quality, maybe GST and a vocoder. In which case I would recommend a universal ParallelWaveGAN (which I am trying to train right now).

I swear I tried it with fwd attn on and got a worse result…I’ll be sure to really check next time.

Taco and Fwd Attention is also what got me bad results, so it checks out. I would suggest getting familiarized with Edresson’s fork, it’s multispeaker :grinning:

Do all your speakers “sound good” with that fork? I.e. do any sound “borgy”?

They sound okay. I am using a private dataset though.

Cool, I have small data I am training on (~20min) with LibriTTS which ends up sounding pretty bad. Trying to understand if it is the data quality or training params. If anyone else can share their multispeaker LibriTTS experience/results to compare, that would be awesome.

My suggestion would be to try training a new model on Edresson’s fork, using Taco2 and GST. There is no reason why LibriTTS should be the culprit because it is a big dataset. If you read this paper, it is clear that the speaker encoder plays a big role in the representations and Edresson has changed the way the speaker embeddings are handled. I should also say that, using his fork, I am able to get much better alignments much sooner.

I will try to train the LibriTTS PWGAN soon, so you can try it too :grinning:

1 Like

Just to be sure I am interpreting correctly, you believe that (in principle) we should not get “borgy” speakers training on LibriTTS? Do you believe that some of the better speakers in the soundcloud link in the OP are probably reasonable quality (they do sound like it to me)?

My guess is that it would definitely be possible, but it may be for a lot of reasons. It is a fact that some voices are a bit harder to model than others and maybe the approach to how embeddings are learnt is not good for them. In terms of vocabulary, I think LibriTTS should be very extensive, so even a voice that sounds worse, should present with an aligned output either way. The paper I mentioned above highlights that they do embeddings learning by sampling; that is, they extract a speaker embedding from all utterances of this speaker, and then average it. I guess this might help in having a wider representation of the voice. In the small dataset I have, I haven’t really noticed this problem and any voice I have given it that is in the dataset, it was able to do synthesis. But I remember I did use a vocoder and it is also known that vocoders are good with spectrogram representations. In specifics, I tried to synthesize using a specific voice both using GL and WaveRNN, and I do remember that the GL quality was much worse than I expected. So there is a lot of factors at play.