I share Speaker Encoder implementation which is in an experimental mode. however, it works quite well so far. You can visit https://github.com/mozilla/TTS/tree/dev/speaker_encoder to play with it. Any feedback is welcome.
I’m currently training the Speaker_Encoder on the LibriTTS 360 data set. Besides of the required paths adjustments, I’m using the speaker_encoder/config.json
in its default setting. The training seem to proceed incredibly slow, but both my GPU (RTX 2080 Super) and CPU (i7-9700 @ 3GHz x 8) are hardly utilized (~4% and ~30% respectively). I noticed that LoaderTime and StepTime are rather high, but as this is my first try-out with the Speaker_Encoder I can’t assess as to whether they are too high.
| > Step:1 Loss:3.26553 AvgLoss:3.26553 GradNorm:3.69772 StepTime:0.59 LoaderTime:127.44 LR:0.000100
BEST MODEL (3.26553) : /home/**/Documents/outputs/libri_tts/speaker_encoder/libritts_360-half-May-13-2020_08+39AM-fff8a11/best_model.pth.tar
| > Step:2 Loss:3.29812 AvgLoss:3.26585 GradNorm:3.76372 StepTime:0.59 LoaderTime:122.70 LR:0.000100
| > Step:3 Loss:3.22227 AvgLoss:3.26542 GradNorm:3.52534 StepTime:0.59 LoaderTime:124.80 LR:0.000100
BEST MODEL (3.26542) : /home/**/Documents/outputs/libri_tts/speaker_encoder/libritts_360-half-May-13-2020_08+39AM-fff8a11/best_model.pth.tar
| > Step:4 Loss:3.26529 AvgLoss:3.26542 GradNorm:3.28403 StepTime:0.59 LoaderTime:123.52 LR:0.000100
BEST MODEL (3.26542) : /home/**/Documents/outputs/libri_tts/speaker_encoder/libritts_360-half-May-13-2020_08+39AM-fff8a11/best_model.pth.tar
| > Step:5 Loss:3.31441 AvgLoss:3.26591 GradNorm:2.57975 StepTime:0.60 LoaderTime:130.64 LR:0.000100
| > Step:6 Loss:3.29085 AvgLoss:3.26616 GradNorm:4.37252 StepTime:0.60 LoaderTime:126.86 LR:0.000100
Tensorboard screenshot:
I guess my questions are:
- Is it normal that the training of the speaker_encoder proceeds that slow?
- How could I improve the hardware utilization? What are the bottlenecks here?
- Do I need to do some preprocessing on the dataset or split into train and eval?
- It is normal. Dataloader is not really efficient and it loads a lot of data dynamically.
- You can find out more efficient way to load the data.
- No you don’t need additional split.