My Success with Mozilla TTS

Dear All,
i wanted to show off my results with Mozilla TTS and ask if any of you have ideas about improvement as follows:

  • clearness of voice (this one is a bit dull)
  • noise removal (clapping, mic-humming, etc.)
  • reverberation removal (the training data contained a lot of reverb)
    Best
    Peanut

Can you tell more about the data and the training? Without knowing more details it is hard to help.

P.S. Would you mind refering our project under the video?

Hi Erogol,

I’ve added the project link to the video’s description, and will do so on all further videos. Thanks for reminding me to give credit where it is due. I may even do a video where some celebrity introduces the Mozilla-TTS project.

As for training, I implemented Attentron and a spaker-encoder loss (using a pretrained model from https://github.com/mozilla/TTS/wiki/Released-Models ).
I trained a model on VCTK and fine-tuned on data from individual speakers, mostly by using data crawled from public speeches an talks. i use the pw-gan from the TTS repo as vocoder.

compare to The vocal synthesis channel, my audio lacks quality: their model speaks more clearly, the audio quality and pronunciation is better.
I simply do not understand how Vocal Synthesis is able to render so many speakers with such high quality. They cannot have gotten large, high-quality data for all of them.

Any guesses? I myself had no tremendous success with few-shot synthesis, the best results came from fine-tuning my model to individual speakers.

1 Like

how large is your individual datasets?
is Attentron something diff from Tacotron? If so can you provide a link.
Can you share the config for your Attentron model?
My guess, they train Tacotron2 on each speaker individually with a better dataset.
Try WaveGrad vocoder. It’s supposed to work better.

P.S. I really like the trump poem :slight_smile:

My audio config is below. The rest of the config is heavily modified, let me know if you want to know specific parameters.
The datasets range in size, i find about 30 mins are absolutely necessary, results start to get good when there are >2h.

I’öl give WaveGrad a try sometimes, thank’s for the recommendation.
As for Vocal Synthesis Channel: I’m still baffled how they got such a large set of data in such high quality. Do you know of techniques to (semi)-automate this task? Are there tools for data cleansing?

And here is the attentron paper: https://hyperconnect.github.io/Attentron/

    "audio": {
    "num_mels": 80,
    "mel_fmin": 50.0,
    "mel_fmax": 7600.0,
    "spec_gain": 20.0,

    // stft
    "fft_size": 2048,
    "win_length": null, //1024,
    "hop_length": null, //256,
    "frame_length_ms": 64.0,
    "frame_shift_ms": 16.0,

    // audio
    "sample_rate": 16000,
    "preemphasis": 0.97, // first order filter to mitigate difference in power between frequencies
    "ref_level_db": 20,

    // silence trimming - DONT USE
    "do_trim_silence": false,
    "trim_db": 60,

    // griffin lim
    "power": 1.5,
    "griffin_lim_iters": 60,

    // Normalization parameters
    "do_sound_norm": true,   // normalize the wav values in range [0, 1]. Different from dividing by 2**15.
    "signal_norm": true,  // normalize the spec values in range [0, 1]
    "min_level_db": -100,
    "symmetric_norm": true,
    "max_norm": 4.0,
    "clip_norm": true
},

you can get random voice clips and use STT apis to transcribe. It is a hacky way to create a TTS dataset.

To try the released WaveGrad model you need to change your audio config to be compatible. So check out one of the latest released model configs to see what are good values.

It is also better to see the whole scope of the config if it not private. Or you can send me as a DM if you wish.

I remember the paper. Actually I was planning to implement it but couldn’t make some time. If you like we can work on that together to push it to the upstream.

Even I created an issues but no body picked up. (It is hard to get real contribution from people :))

I find that the pwgan works good, however there comes some hissing into the audio when trained too long (>700k steps), i believe some other people have also noted it.
Also, it is a bit metallic and requires fitting to the individual speaker for best results.
I might try WaveGrad some time…

If i find the time to integrate attentron into mozilla tts, i’ll let you know, thanks.

thanks for the idea of using SST to create a dataset. I’m already doing this, however it discards a significant portion of data, which is not optimal for speakers where there is no abundance of records.

I’m also interested in few- and zero shot TTS. Attentron is supposed to provide that, but for me, it’s not working: find attached a recording where i pre-trained my system on VCTK and then zero-shot (i.e. no gradient updates) Joe Biden: While it has some vague similarity, the model can’t even speak straight.

has anybody had some real success with google’s more advanced, auto-encoder based models? Such as this?