[UPDATE] Released Speaker Encoder

I share Speaker Encoder implementation which is in an experimental mode. however, it works quite well so far. You can visit https://github.com/mozilla/TTS/tree/dev/speaker_encoder to play with it. Any feedback is welcome.

1 Like

I’m currently training the Speaker_Encoder on the LibriTTS 360 data set. Besides of the required paths adjustments, I’m using the speaker_encoder/config.json in its default setting. The training seem to proceed incredibly slow, but both my GPU (RTX 2080 Super) and CPU (i7-9700 @ 3GHz x 8) are hardly utilized (~4% and ~30% respectively). I noticed that LoaderTime and StepTime are rather high, but as this is my first try-out with the Speaker_Encoder I can’t assess as to whether they are too high.

| > Step:1  Loss:3.26553  AvgLoss:3.26553  GradNorm:3.69772  StepTime:0.59 LoaderTime:127.44  LR:0.000100

BEST MODEL (3.26553) : /home/**/Documents/outputs/libri_tts/speaker_encoder/libritts_360-half-May-13-2020_08+39AM-fff8a11/best_model.pth.tar
| > Step:2 Loss:3.29812 AvgLoss:3.26585 GradNorm:3.76372 StepTime:0.59 LoaderTime:122.70 LR:0.000100
| > Step:3 Loss:3.22227 AvgLoss:3.26542 GradNorm:3.52534 StepTime:0.59 LoaderTime:124.80 LR:0.000100

BEST MODEL (3.26542) : /home/**/Documents/outputs/libri_tts/speaker_encoder/libritts_360-half-May-13-2020_08+39AM-fff8a11/best_model.pth.tar
| > Step:4 Loss:3.26529 AvgLoss:3.26542 GradNorm:3.28403 StepTime:0.59 LoaderTime:123.52 LR:0.000100

BEST MODEL (3.26542) : /home/**/Documents/outputs/libri_tts/speaker_encoder/libritts_360-half-May-13-2020_08+39AM-fff8a11/best_model.pth.tar
| > Step:5 Loss:3.31441 AvgLoss:3.26591 GradNorm:2.57975 StepTime:0.60 LoaderTime:130.64 LR:0.000100
| > Step:6 Loss:3.29085 AvgLoss:3.26616 GradNorm:4.37252 StepTime:0.60 LoaderTime:126.86 LR:0.000100

Tensorboard screenshot:

I guess my questions are:

  1. Is it normal that the training of the speaker_encoder proceeds that slow?
  2. How could I improve the hardware utilization? What are the bottlenecks here?
  3. Do I need to do some preprocessing on the dataset or split into train and eval?
  1. It is normal. Dataloader is not really efficient and it loads a lot of data dynamically.
  2. You can find out more efficient way to load the data.
  3. No you don’t need additional split.
1 Like