Tacotron training error: Mean-Var stats does not match the given feature dimensions

I’m trying to train a Tacotron model on a custom dataset. I am on Windows 10 and Python 3.7.6, torch 1.7.1 with cuda 10.2. However, the training fails with an error:

python .\TTS\bin\train_tacotron.py --config_path .\TTS\tts\configs\config_fin.json
 > Using CUDA:  True
 > Number of GPUs:  1
   >  Mixed precision mode is ON
 > Git Hash: db88c21
 > Experiment folder: data/keep\librivox-finnish-December-18-2020_06+52PM-db88c21
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
  | > griffin_lim_iters:60
  | > signal_norm:True
  | > symmetric_norm:True
  | > mel_fmin:50.0
  | > mel_fmax:7600.0
  | > spec_gain:1.0
  | > stft_pad_mode:reflect
  | > max_norm:4.0
  | > clip_norm:True
  | > do_trim_silence:False
  | > trim_db:60
  | > do_sound_norm:False
  | > stats_path:data/librivox-finnish/scale_stats.npy
  | > hop_length:256
  | > win_length:1024
  | > Found 14778 files in F:\Datasets\librivox_finnish2
  > Using model: Tacotron
 
  > Model has 8566629 parameters
 
  > EPOCH: 0/1000
 
  > Number of output frames: 7
 
  > DataLoader initialization
  | > Use phonemes: False
  | > Number of instances : 14778
  | > Max length sequence: 346
  | > Min length sequence: 3
  | > Avg length sequence: 67.62606577344701
  | > Num. instances discarded by max-min (max=153, min=6) seq limits: 152
  | > Batch group size: 256.
 
  > TRAINING (2020-12-18 18:52:07)
  > Using CUDA:  True
  > Number of GPUs:  1
  > Using CUDA:  True
  > Number of GPUs:  1
  > Using CUDA:  True
  > Number of GPUs:  1
  > Using CUDA:  True
  > Number of GPUs:  1
  ! Run is removed from data/keep\librivox-finnish-December-18-2020_06+52PM-db88c21
 Traceback (most recent call last):
   File ".\TTS\bin\train_tacotron.py", line 690, in <module>
     main(args)
   File ".\TTS\bin\train_tacotron.py", line 602, in main
     global_step, epoch, scaler, scaler_st, speaker_mapping)
   File ".\TTS\bin\train_tacotron.py", line 143, in train
     for num_iter, data in enumerate(data_loader):
   File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\dataloader.py", line 435, in __next__
     data = self._next_data()
   File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\dataloader.py", line 1085, in _next_data
     return self._process_data(data)
   File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\dataloader.py", line 1111, in _process_data
     data.reraise()
   File "D:\lasa\Code\tts_venv\lib\site-packages\torch\_utils.py", line 428, in reraise
     raise self.exc_type(msg)
 RuntimeError: Caught RuntimeError in DataLoader worker process 0.
 Original Traceback (most recent call last):
   File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\_utils\worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
   File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 47, in fetch
     return self.collate_fn(data)
   File "d:\lasa\code\tts\TTS\tts\datasets\TTSDataset.py", line 265, in collate_fn
     self.ap.spectrogram(w).astype('float32') for w in wav
   File "d:\lasa\code\tts\TTS\tts\datasets\TTSDataset.py", line 265, in <listcomp>
     self.ap.spectrogram(w).astype('float32') for w in wav
   File "d:\lasa\code\tts\TTS\utils\audio.py", line 222, in spectrogram
     return self._normalize(S)
   File "d:\lasa\code\tts\TTS\utils\audio.py", line 120, in _normalize
     raise RuntimeError(' [!] Mean-Var stats does not match the given feature dimensions.')
 RuntimeError:  [!] Mean-Var stats does not match the given feature dimensions.

I also tried to train with disabled Mean-Var normalization, but that produces another error:

    Traceback (most recent call last):
  File ".\TTS\bin\train_tacotron.py", line 690, in <module>
    main(args)
  File ".\TTS\bin\train_tacotron.py", line 602, in main
    global_step, epoch, scaler, scaler_st, speaker_mapping)
  File ".\TTS\bin\train_tacotron.py", line 182, in train
    text_lengths)
  File "D:\lasa\Code\tts_venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "d:\lasa\code\tts\TTS\tts\layers\losses.py", line 359, in forward
    postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
  File "D:\lasa\Code\tts_venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "d:\lasa\code\tts\TTS\tts\layers\losses.py", line 204, in forward
    return self.loss_func(x_diff, target_diff, length-1)
  File "D:\lasa\Code\tts_venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "d:\lasa\code\tts\TTS\tts\layers\losses.py", line 50, in forward
    loss = functional.l1_loss(x * mask, target * mask, reduction='sum')
RuntimeError: The size of tensor a (80) must match the size of tensor b (513) at non-singleton dimension 2

My configuration is here: https://pastebin.com/zTLx0Tp4

Changing the model to Tacotron2 seems to fix the issue. Not sure if this is a bug or some kind of misconfiguration?

Hi @lasa01 - welcome to the forum!

I don’t know that this will fix it for sure but I would try disabling the “stats_path” value in your config by setting it to null rather than an empty string just in case it’s reading that empty string in and getting messed up with the code that does/doesn’t apply the stats_path code (by using null you’ll be consistent with some of the config files where it isn’t set)

“stats_path”: null

From a quick skim over your config, the other values seem reasonable and nothing else leaps out.

Thanks for the suggestion, but unfortunately it still doesn’t work. I get the same error as with an empty string. Also, the training works perfectly with Tacotron2 even with the stats path defined, so I don’t think the issue is there.

Yes you are right, seems to be a problem with the code. Just tried it myself and got the same error.

You could also remove the following loss functions:

"postnet_diff_spec_alpha": 0.0, // differential spectral loss weight. If > 0, it is enabled
"decoder_diff_spec_alpha": 0.0, // differential spectral loss weight. If > 0, it is enabled
"decoder_ssim_alpha": 0.0, // decoder ssim loss weight. If > 0, it is enabled
"postnet_ssim_alpha": 0.0, // postnet ssim loss weight. If > 0, it is enabled

Thanks, that seems to fix the problem when not using Mean-Var normalization. Still get the same error when using Mean-Var though.

Actually, it only trains for 1 epoch. On the second epoch I get:

Traceback (most recent call last):
  File ".\TTS\bin\train_tacotron.py", line 708, in <module>
    main(args)
  File ".\TTS\bin\train_tacotron.py", line 611, in main
    scaler_st)
  File ".\TTS\bin\train_tacotron.py", line 148, in train
    text_input, text_lengths, mel_input, mel_lengths, linear_input, stop_targets, speaker_ids, speaker_embeddings, max_text_length, max_spec_length = format_data(data)
  File ".\TTS\bin\train_tacotron.py", line 112, in format_data
    stop_targets.size(1) // c.r, -1)
RuntimeError: shape '[32, 47, -1]' is invalid for input of size 7616

Nevermind, swithing to dev branch caused that error, seems to work now on master.