I’m trying to train a Tacotron model on a custom dataset. I am on Windows 10 and Python 3.7.6, torch 1.7.1 with cuda 10.2. However, the training fails with an error:
python .\TTS\bin\train_tacotron.py --config_path .\TTS\tts\configs\config_fin.json
> Using CUDA: True
> Number of GPUs: 1
> Mixed precision mode is ON
> Git Hash: db88c21
> Experiment folder: data/keep\librivox-finnish-December-18-2020_06+52PM-db88c21
> Setting up Audio Processor...
| > sample_rate:22050
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > stats_path:data/librivox-finnish/scale_stats.npy
| > hop_length:256
| > win_length:1024
| > Found 14778 files in F:\Datasets\librivox_finnish2
> Using model: Tacotron
> Model has 8566629 parameters
> EPOCH: 0/1000
> Number of output frames: 7
> DataLoader initialization
| > Use phonemes: False
| > Number of instances : 14778
| > Max length sequence: 346
| > Min length sequence: 3
| > Avg length sequence: 67.62606577344701
| > Num. instances discarded by max-min (max=153, min=6) seq limits: 152
| > Batch group size: 256.
> TRAINING (2020-12-18 18:52:07)
> Using CUDA: True
> Number of GPUs: 1
> Using CUDA: True
> Number of GPUs: 1
> Using CUDA: True
> Number of GPUs: 1
> Using CUDA: True
> Number of GPUs: 1
! Run is removed from data/keep\librivox-finnish-December-18-2020_06+52PM-db88c21
Traceback (most recent call last):
File ".\TTS\bin\train_tacotron.py", line 690, in <module>
main(args)
File ".\TTS\bin\train_tacotron.py", line 602, in main
global_step, epoch, scaler, scaler_st, speaker_mapping)
File ".\TTS\bin\train_tacotron.py", line 143, in train
for num_iter, data in enumerate(data_loader):
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\dataloader.py", line 435, in __next__
data = self._next_data()
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\dataloader.py", line 1111, in _process_data
data.reraise()
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\_utils\worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 47, in fetch
return self.collate_fn(data)
File "d:\lasa\code\tts\TTS\tts\datasets\TTSDataset.py", line 265, in collate_fn
self.ap.spectrogram(w).astype('float32') for w in wav
File "d:\lasa\code\tts\TTS\tts\datasets\TTSDataset.py", line 265, in <listcomp>
self.ap.spectrogram(w).astype('float32') for w in wav
File "d:\lasa\code\tts\TTS\utils\audio.py", line 222, in spectrogram
return self._normalize(S)
File "d:\lasa\code\tts\TTS\utils\audio.py", line 120, in _normalize
raise RuntimeError(' [!] Mean-Var stats does not match the given feature dimensions.')
RuntimeError: [!] Mean-Var stats does not match the given feature dimensions.
I also tried to train with disabled Mean-Var normalization, but that produces another error:
Traceback (most recent call last):
File ".\TTS\bin\train_tacotron.py", line 690, in <module>
main(args)
File ".\TTS\bin\train_tacotron.py", line 602, in main
global_step, epoch, scaler, scaler_st, speaker_mapping)
File ".\TTS\bin\train_tacotron.py", line 182, in train
text_lengths)
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "d:\lasa\code\tts\TTS\tts\layers\losses.py", line 359, in forward
postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "d:\lasa\code\tts\TTS\tts\layers\losses.py", line 204, in forward
return self.loss_func(x_diff, target_diff, length-1)
File "D:\lasa\Code\tts_venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "d:\lasa\code\tts\TTS\tts\layers\losses.py", line 50, in forward
loss = functional.l1_loss(x * mask, target * mask, reduction='sum')
RuntimeError: The size of tensor a (80) must match the size of tensor b (513) at non-singleton dimension 2
My configuration is here: https://pastebin.com/zTLx0Tp4
Changing the model to Tacotron2 seems to fix the issue. Not sure if this is a bug or some kind of misconfiguration?