Hello, just to share my results.I’m stopping at 47 k steps for tacotron 2:
The gaps seems normal for my data and not affecting the performance.
As reference for others:
Final audios:
(feature-23 is a mouth twister)
47k.zip (1,0 MB)
Experiment with new LPCNet model:
- real speech.wav = audio from the training set
- old lpcnet model.wav = generated using the real features of real speech.wav with the old model (male)
-
from features.wav = the fine tuned old LPCNet model with the new female voice, audio generated with real speech features. 600k steps with 14h of voice.
test.zip (1,1 MB)
It was a surprise for me to see the male voice model generates female voice.
Now about training speed:
My first model took 3h/epoch with 50h of data using a V100. (Trained for 10 epoch)
Now the new female model with 14h of speech took 30min epoch.
Epoch 1 333333/333333 [==============================] - 1858s 6ms/step - loss: 3.2461 Epoch 2 9536/333333 [..............................] - ETA: 29:54 - loss: 3.2475
It uses CuDNNGRU
so is really fast to train, yes the V100 is pretty fast but most of the speed comes from the optimized CuDNN.
Of couse I’ll share the models, as always.