Any step by step how-to/documentation on synthesizing with a pre-trained model?

I’ve looked for the last couple hours, but haven’t had much success. I’ve tried running synthesize.py with config_tacotron2.json and a downloaded release model, but it crashed and burned with “Error(s) in loading state_dict”.

I’m literally taking guesses as to how to run this. What documentation have I missed??

I’ve only gotten it to work from a pretrained wavernn+tacotron2 per the links here: https://github.com/erogol/WaveRNN
for mozilla tts i had to “git checkout 8a47b46”
and for wavernn i used “git checkout 8a1c152”
Then I followed one of the older “notebooks” on github and came up with this.
https://gist.github.com/anotherdirtbag/a03a8826ddf04bed0f1433bbc84032b0

There’s a few issues in the model I’ve got working though. Namely, the beginning second of each sentence is cutoff, there’s some popping noise, and unpredictable silence in some cases.

Edit: Frankly, this guy’s work sounds better than mine and he didn’t even use WaveRNN. My progress on expressive speech synthesis

A few pointers when it comes to using both pretrained models or when training your own. I probably have made all these mistakes a few times.

  1. Make sure that WaveRNN and TTS are configured with the same sampling rate.
  2. Make sure that the following parameters match between the two: symmetric_norm, signal_norm, max_norm, clip_norm. In general, always compare configuration files for WaveRNN and TTS to make sure they are compatible.
  3. For WaveRNN, make sure that upsample_factors match the hop length. For example, for 16KHz sampling rate and frame_shift_ms = 12.5, the hop length is 200 frames. Product of upsample factors 558=200, which is correct.
  4. Silence trimming should be on if dataset has some silence in the beginning.
  5. Both wavernn and tactron pretrained models often have config files that are incompatible with current source code (and I think so are some config files checked in).

For WaveRNN training - subjective quality continues to improve even as loss function is not decreasing. I usually need about 100K steps with 10bit output before the speech sounds “good”. Mixture of logistics probably needs even more steps.

Cutoff sentences and unpredictable silences are probably symptoms of attention not working well, especially if it happens in longer sentences. Attention is always tricky in tacotron models. Current TTS dev branch has scheduled training, which works really well for tacotron.

2 Likes

Thank you. that’s really helpful info. I’m just starting into this and was unsuccessful in my first and only training attempt (thus finding pretrained models). I’ll try training again with newer source code and see if I can figure it out.