Slow Distributed Training


I’m attempting to train Tacotron2 (from the dev-tacotron2 branch) using multiple GPUs. Using 4 V100’s, it seems that the steps per seconds is slower than training on a single gpu. This is my config:

    "run_name": "moz",
    "run_description": "Train from scratch",

        // Audio processing parameters
        "num_mels": 80,         // size of the mel spec frame.
        "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "frame_length_ms": 50,  // stft window length in ms.
        "frame_shift_ms": 12.5, // stft window hop-lengh in ms.
        "preemphasis": 0.98,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "min_level_db": -100,   // normalization range
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
        "power": 1.5,           // value to sharpen wav signals after GL algorithm.
        "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
        // Normalization parameters
        "signal_norm": true,    // normalize the spec values in range [0, 1]
        "symmetric_norm": false, // move normalization to range [-1, 1]
        "max_norm": 1,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "mel_fmin": 0.0,         // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 8000.0,        // maximum freq level for mel-spec. Tune for dataset!!
        "do_trim_silence": true  // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)

        "backend": "nccl",
        "url": "tcp:\/\/localhost:54321"

    "reinit_layers": [],

    "model": "Tacotron2",   // one of the model in models/
    "grad_clip": 1,      // upper limit for gradients for clipping.
    "epochs": 1000,         // total number of epochs to train.
    "lr": 0.0001,            // Initial learning rate. If Noam decay is active, maximum learning rate.
    "lr_decay": false,      // if true, Noam learning rate decaying is applied through training.
    "warmup_steps": 4000,   // Noam decay steps to increase the learning rate from 0 to "lr"
    "windowing": false,      // Enables attention windowing. Used only in eval mode.
    "memory_size": 5,       //  ONLY TACOTRON - memory queue size used to queue network predictions to feed autoregressive connection. Useful if r < 5.
    "attention_norm": "softmax",   // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
    "prenet_type": "bn",    // ONLY TACOTRON2 - "original" or "bn".
    "use_forward_attn": true,    // ONLY TACOTRON2 - if it uses forward attention. In general, it aligns faster.
    "transition_agent": false,    // ONLY TACOTRON2 - enable/disable transition agent of forward attention.
    "loss_masking": false,       // enable / disable loss masking against the sequence padding.
    "enable_eos_bos_chars": true, // enable/disable beginning of sentence and end of sentence chars.

    "batch_size": 32,       // Batch size for training. Lower values than 32 might cause hard to learn attention.
    "r": 1,                 // Number of frames to predict for step.
    "wd": 0.000001,         // Weight decay weight.
    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
    "save_step": 1000,      // Number of training steps expected to save traning stats and checkpoints.
    "print_step": 10,       // Number of steps to log traning on console.
    "tb_model_param_stats": true,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
    "batch_group_size": 8,  //Number of batches to shuffle after bucketing.

    "run_eval": true,
    "test_delay_epochs": 100,  //Until attention is aligned, testing only wastes computation time.
    "data_path": "/home/TTS/LJSpeech-1.1",  // DATASET-RELATED: can overwritten from command argument
    "meta_file_train": "metadata_train.csv",      // DATASET-RELATED: metafile for training dataloader.
    "meta_file_val": "metadata_val.csv",    // DATASET-RELATElD: metafile for evaluation dataloader.
    "dataset": "ljspeech",      // DATASET-RELATED: one of TTS.dataset.preprocessors depending on your target dataset. Use "tts_cache" for pre-computed dataset by
    "min_seq_len": 0,       // DATASET-RELATED: minimum text length to use in training
    "max_seq_len": 150,     // DATASET-RELATED: maximum text length
    "output_path": "/home/TTS/ljspeech_models",      // DATASET-RELATED: output path for all training outputs.
    "num_loader_workers": 8,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
    "phoneme_cache_path": "ljspeech_phonemes",  // phoneme computation is slow, therefore, it caches results in the given folder.
    "use_phonemes": true,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
    "phoneme_language": "en-us",     // depending on your target language, pick one from
    "text_cleaner": "phoneme_cleaners"

| > Step:6/71 GlobalStep:50 TotalLoss:0.54487 PostnetLoss:0.43788 DecoderLoss:0.10699 StopLoss:0.66825 GradNorm:0.21360 GradNormST:0.77358 AvgTextLen:46.2 AvgSpecLen:226.4 StepTime:6.46 LR:0.000100
| > Step:16/71 GlobalStep:60 TotalLoss:0.53199 PostnetLoss:0.44408 DecoderLoss:0.08792 StopLoss:0.65621 GradNorm:0.20161 GradNormST:0.78307 AvgTextLen:63.0 AvgSpecLen:317.3 StepTime:8.86 LR:0.000100

Any reasons why this might be happening?

If it is the first epoch, which I see it is, it is caching the phonemes for the dataset. This might be the reason. How is its performance for the next epoch ?

It doesn’t seem like it improves past the first epoch:

| > Step:6/71 GlobalStep:50 TotalLoss:0.54487 PostnetLoss:0.43788 DecoderLoss:0.10699 StopLoss:0.66825 GradNorm:0.21360 GradNormST:0.77358 AvgTextLen:46.2 AvgSpecLen:226.4 StepTime:6.46 LR:0.000100
| > Step:16/71 GlobalStep:60 TotalLoss:0.53199 PostnetLoss:0.44408 DecoderLoss:0.08792 StopLoss:0.65621 GradNorm:0.20161 GradNormST:0.78307 AvgTextLen:63.0 AvgSpecLen:317.3 StepTime:8.86 LR:0.000100
| > Step:26/71 GlobalStep:70 TotalLoss:0.53116 PostnetLoss:0.44882 DecoderLoss:0.08234 StopLoss:0.64465 GradNorm:0.20479 GradNormST:0.60509 AvgTextLen:77.3 AvgSpecLen:364.4 StepTime:10.82 LR:0.000100
| > Step:36/71 GlobalStep:80 TotalLoss:0.52513 PostnetLoss:0.43669 DecoderLoss:0.08845 StopLoss:0.63135 GradNorm:0.19909 GradNormST:0.62129 AvgTextLen:93.2 AvgSpecLen:452.8 StepTime:14.81 LR:0.000100
| > Step:46/71 GlobalStep:90 TotalLoss:0.52258 PostnetLoss:0.42938 DecoderLoss:0.09320 StopLoss:0.61990 GradNorm:0.20076 GradNormST:0.59656 AvgTextLen:106.4 AvgSpecLen:501.6 StepTime:14.51 LR:0.000100
| > Step:56/71 GlobalStep:100 TotalLoss:0.52352 PostnetLoss:0.43600 DecoderLoss:0.08752 StopLoss:0.62039 GradNorm:0.19930 GradNormST:0.65667 AvgTextLen:119.5 AvgSpecLen:572.5 StepTime:11.96 LR:0.000100
| > Step:66/71 GlobalStep:110 TotalLoss:0.51841 PostnetLoss:0.43380 DecoderLoss:0.08460 StopLoss:0.60948 GradNorm:0.19739 GradNormST:0.61151 AvgTextLen:132.9 AvgSpecLen:652.7 StepTime:5.30 LR:0.000100
| > EPOCH END – GlobalStep:115 AvgTotalLoss:1.16338 AvgPostnetLoss:0.43881 AvgDecoderLoss:0.08936 AvgStopLoss:0.63521 EpochTime:782.54 AvgStepTime:10.87

| > TotalLoss: 0.83932 PostnetLoss: 0.14724 DecoderLoss:0.11295 StopLoss: 0.57913
| > TotalLoss: 0.80395 PostnetLoss: 0.11978 DecoderLoss:0.09565 StopLoss: 0.58851
/usr/local/lib/python3.6/dist-packages/librosa/util/ FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
if np.issubdtype(x.dtype, float) or np.issubdtype(x.dtype, complex):
warning: audio amplitude out of range, auto clipped.
| > Training Loss: 0.43881 Validation Loss: 0.11574

BEST MODEL (0.11574) : /home/TTS/ljspeech_models/mozilla-nomask-fattn-bn-May-11-2019_09+02AM-3c8aef3/best_model.pth.tar

Epoch 1/1000
| > Step:4/71 GlobalStep:120 TotalLoss:0.51753 PostnetLoss:0.42890 DecoderLoss:0.08863 StopLoss:0.58340 GradNorm:0.20354 GradNormST:0.60581 AvgTextLen:39.8 AvgSpecLen:210.8 StepTime:10.05 LR:0.000100
| > Step:14/71 GlobalStep:130 TotalLoss:0.51455 PostnetLoss:0.43335 DecoderLoss:0.08120 StopLoss:0.58700 GradNorm:0.20094 GradNormST:0.51306 AvgTextLen:60.6 AvgSpecLen:292.2 StepTime:7.21 LR:0.000100
| > Step:24/71 GlobalStep:140 TotalLoss:0.51323 PostnetLoss:0.42749 DecoderLoss:0.08574 StopLoss:0.58115 GradNorm:0.19666 GradNormST:0.56741 AvgTextLen:74.9 AvgSpecLen:360.2 StepTime:10.74 LR:0.000100
| > Step:34/71 GlobalStep:150 TotalLoss:0.50972 PostnetLoss:0.41712 DecoderLoss:0.09261 StopLoss:0.56152 GradNorm:0.19392 GradNormST:0.51881 AvgTextLen:89.6 AvgSpecLen:430.6 StepTime:13.61 LR:0.000100
| > Step:44/71 GlobalStep:160 TotalLoss:0.51037 PostnetLoss:0.42371 DecoderLoss:0.08666 StopLoss:0.57462 GradNorm:0.19522 GradNormST:0.60050 AvgTextLen:101.9 AvgSpecLen:495.9 StepTime:12.77 LR:0.000100
| > Step:54/71 GlobalStep:170 TotalLoss:0.50645 PostnetLoss:0.42643 DecoderLoss:0.08003 StopLoss:0.55054 GradNorm:0.19485 GradNormST:0.52999 AvgTextLen:116.8 AvgSpecLen:573.9 StepTime:18.38 LR:0.000100
| > Step:64/71 GlobalStep:180 TotalLoss:0.50740 PostnetLoss:0.43016 DecoderLoss:0.07723 StopLoss:0.55376 GradNorm:0.19712 GradNormST:0.46557 AvgTextLen:130.2 AvgSpecLen:678.7 StepTime:5.48 LR:0.000100
| > EPOCH END – GlobalStep:187 AvgTotalLoss:1.08023 AvgPostnetLoss:0.42494 AvgDecoderLoss:0.08652 AvgStopLoss:0.56876 EpochTime:741.98 AvgStepTime:10.31

| > TotalLoss: 0.75608 PostnetLoss: 0.13276 DecoderLoss:0.10516 StopLoss: 0.51816
| > TotalLoss: 0.72915 PostnetLoss: 0.10830 DecoderLoss:0.09222 StopLoss: 0.52863
warning: audio amplitude out of range, auto clipped.
| > Training Loss: 0.42494 Validation Loss: 0.10412

BEST MODEL (0.10412) : /home/TTS/ljspeech_models/mozilla-nomask-fattn-bn-May-11-2019_09+02AM-3c8aef3/best_model.pth.tar

Epoch 2/1000
| > Step:2/71 GlobalStep:190 TotalLoss:0.50446 PostnetLoss:0.42553 DecoderLoss:0.07894 StopLoss:0.53321 GradNorm:0.19970 GradNormST:0.51181 AvgTextLen:32.0 AvgSpecLen:172.7 StepTime:10.11 LR:0.000100
| > Step:12/71 GlobalStep:200 TotalLoss:0.50144 PostnetLoss:0.42038 DecoderLoss:0.08106 StopLoss:0.53346 GradNorm:0.19628 GradNormST:0.58735 AvgTextLen:57.3 AvgSpecLen:287.2 StepTime:9.37 LR:0.000100
| > Step:22/71 GlobalStep:210 TotalLoss:0.49974 PostnetLoss:0.41562 DecoderLoss:0.08412 StopLoss:0.52313 GradNorm:0.19687 GradNormST:0.45988 AvgTextLen:72.0 AvgSpecLen:360.4 StepTime:12.89 LR:0.000100
| > Step:32/71 GlobalStep:220 TotalLoss:0.49886 PostnetLoss:0.40855 DecoderLoss:0.09032 StopLoss:0.52051 GradNorm:0.19232 GradNormST:0.46348 AvgTextLen:87.8 AvgSpecLen:438.6 StepTime:11.99 LR:0.000100
| > Step:42/71 GlobalStep:230 TotalLoss:0.49620 PostnetLoss:0.41395 DecoderLoss:0.08225 StopLoss:0.51213 GradNorm:0.19488 GradNormST:0.44572 AvgTextLen:100.7 AvgSpecLen:505.2 StepTime:14.61 LR:0.000100
| > Step:52/71 GlobalStep:240 TotalLoss:0.49629 PostnetLoss:0.40865 DecoderLoss:0.08764 StopLoss:0.54463 GradNorm:0.19355 GradNormST:0.55476 AvgTextLen:114.3 AvgSpecLen:578.0 StepTime:15.12 LR:0.000100
| > Step:62/71 GlobalStep:250 TotalLoss:0.49350 PostnetLoss:0.41181 DecoderLoss:0.08169 StopLoss:0.52830 GradNorm:0.19509 GradNormST:0.53981 AvgTextLen:128.3 AvgSpecLen:622.3 StepTime:5.03 LR:0.000100
| > EPOCH END – GlobalStep:259 AvgTotalLoss:1.02503 AvgPostnetLoss:0.41446 AvgDecoderLoss:0.08437 AvgStopLoss:0.52620 EpochTime:743.00 AvgStepTime:10.32

| > TotalLoss: 0.71101 PostnetLoss: 0.12067 DecoderLoss:0.10573 StopLoss: 0.48461
| > TotalLoss: 0.67964 PostnetLoss: 0.08870 DecoderLoss:0.08808 StopLoss: 0.50286
warning: audio amplitude out of range, auto clipped.
| > Training Loss: 0.41446 Validation Loss: 0.08377

BEST MODEL (0.08377) : /home/TTS/ljspeech_models/mozilla-nomask-fattn-bn-May-11-2019_09+02AM-3c8aef3/best_model.pth.tar

Epoch 3/1000
| > Step:0/71 GlobalStep:260 TotalLoss:0.49399 PostnetLoss:0.42217 DecoderLoss:0.07182 StopLoss:0.49627 GradNorm:0.22202 GradNormST:0.53246 AvgTextLen:20.4 AvgSpecLen:107.7 StepTime:4.45 LR:0.000100
| > Step:10/71 GlobalStep:270 TotalLoss:0.49212 PostnetLoss:0.40605 DecoderLoss:0.08607 StopLoss:0.50700 GradNorm:0.19492 GradNormST:0.40409 AvgTextLen:53.6 AvgSpecLen:250.5 StepTime:7.11 LR:0.000100
| > Step:20/71 GlobalStep:280 TotalLoss:0.49043 PostnetLoss:0.40192 DecoderLoss:0.08851 StopLoss:0.50858 GradNorm:0.19022 GradNormST:0.52474 AvgTextLen:68.3 AvgSpecLen:314.5 StepTime:11.48 LR:0.000100
| > Step:30/71 GlobalStep:290 TotalLoss:0.48764 PostnetLoss:0.39601 DecoderLoss:0.09163 StopLoss:0.49873 GradNorm:0.19081 GradNormST:0.47822 AvgTextLen:83.5 AvgSpecLen:404.7 StepTime:11.26 LR:0.000100
| > Step:40/71 GlobalStep:300 TotalLoss:0.48526 PostnetLoss:0.40055 DecoderLoss:0.08471 StopLoss:0.47733 GradNorm:0.18966 GradNormST:0.41977 AvgTextLen:98.9 AvgSpecLen:479.6 StepTime:13.23 LR:0.000100
| > Step:50/71 GlobalStep:310 TotalLoss:0.48338 PostnetLoss:0.40064 DecoderLoss:0.08274 StopLoss:0.48476 GradNorm:0.18995 GradNormST:0.39667 AvgTextLen:109.8 AvgSpecLen:550.8 StepTime:14.04 LR:0.000100
| > Step:60/71 GlobalStep:320 TotalLoss:0.48226 PostnetLoss:0.39908 DecoderLoss:0.08318 StopLoss:0.48741 GradNorm:0.19012 GradNormST:0.48689 AvgTextLen:124.0 AvgSpecLen:610.6 StepTime:4.60 LR:0.000100
| > Step:70/71 GlobalStep:330 TotalLoss:0.48125 PostnetLoss:0.39979 DecoderLoss:0.08146 StopLoss:0.49556 GradNorm:0.18949 GradNormST:0.49198 AvgTextLen:140.1 AvgSpecLen:684.8 StepTime:5.27 LR:0.000100
| > EPOCH END – GlobalStep:331 AvgTotalLoss:0.97941 AvgPostnetLoss:0.40495 AvgDecoderLoss:0.08210 AvgStopLoss:0.49236 EpochTime:758.00 AvgStepTime:10.53

is it LJSpeech ? and do you use ?

I tried with my own dataset, as well as one from M-AILABS (13372 files, ~24 hours of audio). I am using

unfortunately no idea? that might be something about the dataset, if the sequences are too long. Check for example sampling rate of the dataset. If it is >20K, it’d take too long to run. That’s my only guess.