A few pointers when it comes to using both pretrained models or when training your own. I probably have made all these mistakes a few times.
- Make sure that WaveRNN and TTS are configured with the same sampling rate.
- Make sure that the following parameters match between the two: symmetric_norm, signal_norm, max_norm, clip_norm. In general, always compare configuration files for WaveRNN and TTS to make sure they are compatible.
- For WaveRNN, make sure that upsample_factors match the hop length. For example, for 16KHz sampling rate and frame_shift_ms = 12.5, the hop length is 200 frames. Product of upsample factors 558=200, which is correct.
- Silence trimming should be on if dataset has some silence in the beginning.
- Both wavernn and tactron pretrained models often have config files that are incompatible with current source code (and I think so are some config files checked in).
For WaveRNN training - subjective quality continues to improve even as loss function is not decreasing. I usually need about 100K steps with 10bit output before the speech sounds “good”. Mixture of logistics probably needs even more steps.
Cutoff sentences and unpredictable silences are probably symptoms of attention not working well, especially if it happens in longer sentences. Attention is always tricky in tacotron models. Current TTS dev branch has scheduled training, which works really well for tacotron.