Hi @erogol - is this suitable?
I can post more screenshots focusing on any particular ones are of interest (or zooming in more). Here are the overall EvalStats and TrainEpochStats charts (for all four sets of runs together) along with the EvalFigures and TestFigures charts for the best run (in terms of audio quality for general usage)
All runs have:
- “use_forward_attn”: false - as per this I train w/o it then turn on for inference; is that still sensible approach?
- “location_attn”: true - left this untouched
- had tuned based on CheckSpectrograms notebook
1st run
Orange + Red (continuation/fine-tuning)
neil14_october_v1-October-04-2019_02+28AM-3abf3a4
-
when fine-tuning (ie continuing training) with red line, I’d actually made a handful of minor corpus text corrections that were discovered after initial run (orange);
-
“max_seq_len”: 200
-
“do_trim_silence”: true
-
“gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] - followed the gradual training values provided
-
“memory_size”: -1 - had left this as default based on TTS/config.json, but later adjusted to 5 as saw TTS/config_tacotron.json had it higher
2nd run
Cyan
neil14_october_v3-October-06-2019_11+49PM-3abf3a4
-
“max_seq_len”: 195
-
“do_trim_silence”: true
-
“gradual_training” values unchanged from above
3rd run
Pink
neil14_october_v4-October-10-2019_12+32AM-3abf3a4
-
“max_seq_len”: 164
-
“do_trim_silence”: true
-
“gradual_training” values unchanged from above
-
some phoneme corrections in ESpeak
4th run
Turquoise
neil14_october_v4-October-12-2019_12+16AM-3abf3a4
-
“max_seq_len”: 164
-
“do_trim_silence”: false
-
some additional phoneme corrections in ESpeak
-
tried bigger batch size for later grad training (simply as it’d be faster,right? ; seems to have been fine)
“gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 32], [290000, 1, 16]]
Observations: The best audio output is actually from the 2nd run (Cyan); the best model from 4th run seemed better (BEST MODEL (0.03737) vs BEST MODEL (0.08910)) but it was unusable during inference as never got any audio from it and it gives “Decoder stopped with 'max_decoder_steps” even on short phrases.
Also none of them could produce consistent output when it transitioned to r=1. The best results were all in r=2 stage