Empty sounds with LJS + GL

RTS · April 18, 2021, 8:30am

Hello,

I followed this and that to generate a model from LJS and synthetize a sound. But all I get are zeros and the sound is empty.

I am not sure what I did wrong, and I am all new to this. Thank you very much for your help.

I have embedded all the commands in a Makefile and a Python venv (see at the end of the post), I hope that will not confuse you. Thank you very much for your help.

Here are my logs, see the hexdump of the WAV at the end:

**myhost$ make run**
source _venv/bin/activate; PYTHONPATH=run python3 config.py
source _venv/bin/activate; cd TTS && python3 TTS/bin/train_tacotron.py --config_path ../config.json | tee training.log
2021-04-18 04:44:38.213636: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcud
art.so.10.1: cannot open shared object file: No such file or directory
2021-04-18 04:44:38.213657: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/redacted/tts/_venv/lib/python3.6/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
 > Using CUDA:  False
 > Number of GPUs:  0
   >  Mixed precision mode is ON
 > Git Hash: e9e0784
 > Experiment folder: ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:None
 | > hop_length:256
 | > win_length:1024
 | > Found 13100 files in /home/redacted/tts/LJSpeech-1.1
 > Using model: Tacotron2

 > Model has 47914548 parameters

 > DataLoader initialization
 | > Use phonemes: True
   | > phoneme language: en-us
 | > Number of instances : 12969
 | > Max length sequence: 187
 | > Min length sequence: 5
 | > Avg length sequence: 98.3403500655409
 | > Num. instances discarded by max-min (max=153, min=6) seq limits: 476
 | > Batch group size: 128.

 > EPOCH: 0/1000

 > Number of output frames: 7

 > TRAINING (2021-04-18 04:44:39) 
/home/redacted/tts/_venv/lib/python3.6/site-packages/torch/cuda/amp/autocast_mode.py:118: UserWarning: torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available.  Disabling.")

   --> STEP: 24/195 -- GLOBAL_STEP: 25
     | > decoder_loss: 4.55581  (4.81456)
     | > postnet_loss: 2.73150  (5.05209)
     | > stopnet_loss: 0.72497  (0.76779)
     | > decoder_coarse_loss: 4.19881  (4.71119)
     | > decoder_ddc_loss: 0.00184  (0.00324)
     | > ga_loss: 0.00559  (0.00769)
     | > decoder_diff_spec_loss: 0.01748  (0.01792)
     | > postnet_diff_spec_loss: 2.11584  (3.49522)
     | > decoder_ssim_loss: 0.53970  (0.55583)
     | > postnet_ssim_loss: 0.53068  (0.55149)
     | > loss: 5.88996  (7.22106)
     | > align_error: 0.98973  (0.98480)
     | > max_spec_length: 613.0
     | > max_text_length: 103.0
     | > step_time: 18.5761
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 49/195 -- GLOBAL_STEP: 50
     | > decoder_loss: 0.57258  (3.66746)
     | > postnet_loss: 0.96710  (3.34215)
     | > stopnet_loss: 0.50418  (0.72119)
     | > decoder_coarse_loss: 0.36679  (3.19028)
     | > decoder_ddc_loss: 0.00296  (0.00259)
     | > ga_loss: 0.00397  (0.00620)
     | > decoder_diff_spec_loss: 0.16638  (0.08393)
     | > postnet_diff_spec_loss: 1.61417  (2.50230)
     | > decoder_ssim_loss: 0.54864  (0.58755)
     | > postnet_ssim_loss: 0.56227  (0.58478)
     | > loss: 1.45568  (5.23634)
     | > align_error: 0.98672  (0.98627)
     | > max_spec_length: 730.0
     | > max_text_length: 103.0
     | > step_time: 21.6303
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 74/195 -- GLOBAL_STEP: 75
     | > decoder_loss: 0.16894  (2.50801)
     | > postnet_loss: 0.87962  (2.49905)
     | > stopnet_loss: 0.33417  (0.61048)
     | > decoder_coarse_loss: 0.12851  (2.17494)
     | > decoder_ddc_loss: 0.00166  (0.00257)
     | > ga_loss: 0.00329  (0.00531)
     | > decoder_diff_spec_loss: 0.09778  (0.09582)
     | > postnet_diff_spec_loss: 1.26165  (2.06643)
     | > decoder_ssim_loss: 0.56711  (0.59162)
     | > postnet_ssim_loss: 0.60017  (0.59898)
     | > loss: 1.01759  (3.83230)
     | > align_error: 0.98966  (0.98666)
     | > max_spec_length: 812.0
     | > max_text_length: 128.0
     | > step_time: 24.9676
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 99/195 -- GLOBAL_STEP: 100
     | > decoder_loss: 0.13837  (1.90996)
     | > postnet_loss: 0.53094  (2.03662)
     | > stopnet_loss: 0.33364  (0.54072)
     | > decoder_coarse_loss: 0.09041  (1.65261)
     | > decoder_ddc_loss: 0.00197  (0.00238)
     | > ga_loss: 0.00286  (0.00473)
     | > decoder_diff_spec_loss: 0.07020  (0.09270)
     | > postnet_diff_spec_loss: 0.83398  (1.78797)
     | > decoder_ssim_loss: 0.64371  (0.59933)
     | > postnet_ssim_loss: 0.68419  (0.61408)
     | > loss: 0.82043  (3.08881)
     | > align_error: 0.98795  (0.98731)
     | > max_spec_length: 774.0
     | > max_text_length: 115.0
     | > step_time: 24.1785
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 124/195 -- GLOBAL_STEP: 125
     | > decoder_loss: 0.07066  (1.54349)
     | > postnet_loss: 0.39794  (1.72865)
     | > stopnet_loss: 0.31882  (0.49629)
     | > decoder_coarse_loss: 0.05755  (1.33377)
     | > decoder_ddc_loss: 0.00146  (0.00222)
     | > ga_loss: 0.00246  (0.00431)
     | > decoder_diff_spec_loss: 0.04339  (0.08536)
     | > postnet_diff_spec_loss: 0.58380  (1.57657)
     | > decoder_ssim_loss: 0.68006  (0.60992)
     | > postnet_ssim_loss: 0.73716  (0.63143)
     | > loss: 0.68772  (2.61927)
     | > align_error: 0.98990  (0.98774)
     | > max_spec_length: 812.0
     | > max_text_length: 136.0
     | > step_time: 26.3236
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 149/195 -- GLOBAL_STEP: 150
     | > decoder_loss: 0.03828  (1.29419)
     | > postnet_loss: 0.33461  (1.50368)
     | > stopnet_loss: 0.30752  (0.46461)
     | > decoder_coarse_loss: 0.03762  (1.11788)
     | > decoder_ddc_loss: 0.00112  (0.00207)
     | > ga_loss: 0.00220  (0.00398)
     | > decoder_diff_spec_loss: 0.02056  (0.07618)
     | > postnet_diff_spec_loss: 0.46399  (1.40584)
     | > decoder_ssim_loss: 0.66767  (0.61895)
     | > postnet_ssim_loss: 0.76170  (0.64922)
     | > loss: 0.61165  (2.29043)
     | > align_error: 0.99073  (0.98812)
     | > max_spec_length: 853.0
     | > max_text_length: 147.0
     | > step_time: 27.6148
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 174/195 -- GLOBAL_STEP: 175
     | > decoder_loss: 0.03072  (1.11359)
     | > postnet_loss: 0.24894  (1.32778)
     | > stopnet_loss: 0.29823  (0.44101)
     | > decoder_coarse_loss: 0.02360  (0.96161)
     | > decoder_ddc_loss: 0.00088  (0.00193)
     | > ga_loss: 0.00203  (0.00371)
     | > decoder_diff_spec_loss: 0.00659  (0.06693)
     | > postnet_diff_spec_loss: 0.35491  (1.26145)
     | > decoder_ssim_loss: 0.59823  (0.62161)
     | > postnet_ssim_loss: 0.79244  (0.66808)
     | > loss: 0.53805  (2.04356)
     | > align_error: 0.99087  (0.98846)
     | > max_spec_length: 859.0
     | > max_text_length: 147.0
     | > step_time: 28.2039
     | > loader_time: 0.00
     | > current_lr: 0.0001

  --> TRAIN PERFORMACE -- EPOCH TIME: 4536.91 sec -- GLOBAL_STEP: 196
     | > avg_decoder_loss: 0.99635
     | > avg_postnet_loss: 1.21014
     | > avg_stopnet_loss: 0.42589
     | > avg_decoder_coarse_loss: 0.86015
     | > avg_decoder_ddc_loss: 0.00181
     | > avg_ga_loss: 0.00352
     | > avg_decoder_diff_spec_loss: 0.06030
     | > avg_postnet_diff_spec_loss: 1.16087
     | > avg_decoder_ssim_loss: 0.61834
     | > avg_postnet_ssim_loss: 0.68446
     | > avg_loss: 1.88026
     | > avg_align_error: 0.98874
     | > avg_loader_time: 0.00470
     | > avg_step_time: 23.20372

 > EVALUATION 

  --> EVAL PERFORMANCE
     | > avg_decoder_loss: 0.04355 (+0.00000)
     | > avg_postnet_loss: 3.91026 (+0.00000)
     | > avg_stopnet_loss: 0.34370 (+0.00000)
     | > avg_decoder_coarse_loss: 0.00959 (+0.00000)
     | > avg_decoder_ddc_loss: 0.00102 (+0.00000)
     | > avg_ga_loss: 0.00230 (+0.00000)
     | > avg_decoder_diff_spec_loss: 0.00559 (+0.00000)
     | > avg_postnet_diff_spec_loss: 0.00602 (+0.00000)
     | > avg_decoder_ssim_loss: 0.54868 (+0.00000)
     | > avg_postnet_ssim_loss: 0.80211 (+0.00000)
     | > avg_loss: 1.35677 (+0.00000)
     | > avg_align_error: 0.98519 (+0.00000)

 >> BEST MODEL : ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784/best_model.pth.tar

 > EPOCH: 1/1000

 > Number of output frames: 5

 > TRAINING (2021-04-18 06:01:22) 
/home/redacted/tts/_venv/lib/python3.6/site-packages/torch/cuda/amp/autocast_mode.py:118: UserWarning: torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available.  Disabling.")

   --> STEP: 3/195 -- GLOBAL_STEP: 200
     | > decoder_loss: 0.03129  (0.02660)
     | > postnet_loss: 1.62624  (1.47613)
     | > stopnet_loss: 0.50374  (0.53868)
     | > decoder_coarse_loss: 0.01738  (0.01841)
     | > decoder_ddc_loss: 0.00537  (0.00533)
     | > ga_loss: 0.00790  (0.00815)
     | > decoder_diff_spec_loss: 0.00566  (0.00677)
     | > postnet_diff_spec_loss: 1.92546  (1.72932)
     | > decoder_ssim_loss: 0.31401  (0.34279)
     | > postnet_ssim_loss: 0.46388  (0.48827)
     | > loss: 1.15032  (1.07673)
     | > align_error: 0.97053  (0.96764)
     | > max_spec_length: 319.0
     | > max_text_length: 58.0
     | > step_time: 9.9091
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 28/195 -- GLOBAL_STEP: 225
     | > decoder_loss: 0.08435  (0.09471)
     | > postnet_loss: 0.53375  (1.20755)
     | > stopnet_loss: 0.34806  (0.37907)
     | > decoder_coarse_loss: 0.01095  (0.01401)
     | > decoder_ddc_loss: 0.00317  (0.00300)
     | > ga_loss: 0.00315  (0.00475)
     | > decoder_diff_spec_loss: 0.00244  (0.00390)
     | > postnet_diff_spec_loss: 0.81396  (1.62014)
     | > decoder_ssim_loss: 0.38636  (0.33294)
     | > postnet_ssim_loss: 0.64465  (0.52186)
     | > loss: 0.66029  (1.00122)
     | > align_error: 0.97028  (0.97328)
     | > max_spec_length: 486.0
     | > max_text_length: 71.0
     | > step_time: 22.5940
     | > loader_time: 0.00
     | > current_lr: 0.0001

   --> STEP: 53/195 -- GLOBAL_STEP: 250
     | > decoder_loss: 0.06341  (0.08091)
     | > postnet_loss: 0.59418  (1.01696)
     | > stopnet_loss: 0.28890  (0.33931)
     | > decoder_coarse_loss: 0.00840  (0.01181)
     | > decoder_ddc_loss: 0.00217  (0.00264)
     | > ga_loss: 0.00242  (0.00380)
     | > decoder_diff_spec_loss: 0.00181  (0.00307)
     | > postnet_diff_spec_loss: 0.85478  (1.37726)
     | > decoder_ssim_loss: 0.34847  (0.33493)
     | > postnet_ssim_loss: 0.63196  (0.55394)
     | > loss: 0.65689  (0.88821)
     | > align_error: 0.97703  (0.97537)
     | > max_spec_length: 642.0
     | > max_text_length: 96.0
     | > step_time: 26.6962
     | > loader_time: 0.00
     | > current_lr: 0.0001
**myhost$ make gen**
source _venv/bin/activate; cd TTS && tts --text "Hello my friends" --model_path ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784/best_model.pth.tar --config_path ../config.json --out_path ..
2021-04-18 10:07:39.581533: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-04-18 10:07:39.581556: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:None
 | > hop_length:256
 | > win_length:1024
 > Using model: Tacotron2
 > Text: Hello my friends
 > Text splitted to sentences.
['Hello my friends']
   | > Decoder stopped with 'max_decoder_steps
 > Processing time: 20.097420930862427
 > Real-time factor: 0.24595510323637348
 > Saving output to ../Hello_my_friends.wav
**myhost$ hexdump -C Hello_my_friends.wav** 
00000000  52 49 46 46 44 fc 36 00  57 41 56 45 66 6d 74 20  |RIFFD.6.WAVEfmt |
00000010  10 00 00 00 01 00 01 00  22 56 00 00 44 ac 00 00  |........"V..D...|
00000020  02 00 10 00 64 61 74 61  20 fc 36 00 00 00 00 00  |....data .6.....|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0036fc4c

Here is my Makefile:

PYTHON  = python3
GIT     = git

TTS     = TTS
TTS_GIT = https://github.com/mozilla/TTS

LJSPEECH     = LJSpeech-1.1
LJSPEECH_BZ2 = $(LJSPEECH).tar.bz2

VENV    = _venv
DO-VENV = source $(VENV)/bin/activate;

SHELL   = /bin/bash

all: $(TTS)

$(VENV):
    sudo apt-get install python3 python3-dev git gcc g++ espeak-ng
    $(PYTHON) -m venv $@
    $(DO-VENV) pip3 install --upgrade pip
    $(DO-VENV) pip3 install numpy Cython

$(TTS): $(VENV)
    $(GIT) clone $(TTS_GIT) $@
    $(DO-VENV) pip install -e $@

$(LJSPEECH_BZ2):
    $(WGET) wget http://data.keithito.com/data/speech/$@

$(LJSPEECH): $(LJSPEECH_BZ2)
    tar -xjf $<
    shuf $@/metadata.csv > $@/metadata_shuf.csv
    head -n 12000 $@/metadata_shuf.csv > $@/metadata_train.csv
    tail -n 1100  $@/metadata_shuf.csv > $@/metadata_val.csv

run: $(TTS) $(LJSPEECH)
    $(DO-VENV) PYTHONPATH=$@ $(PYTHON) config.py
    $(DO-VENV) cd $(TTS) && $(PYTHON) TTS/bin/train_tacotron.py --config_path ../config.json | tee training.log

gen:
    $(DO-VENV) cd $(TTS) && tts --text "Hello my friends" --model_path ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784/best_model.pth.tar --config_path ../config.json --out_path ..

distclean:
    $(RM) -r $(VENV) $(TTS) $(LJSPEECH)

And here is the script which edits the config.json, config.py:

# load the default config file and update with the local paths and settings.
import json
from TTS.utils.io import load_config
CONFIG = load_config('./TTS/TTS/tts/configs/config.json')
CONFIG['datasets'][0]['path'] = '../LJSpeech-1.1/'  # set the target dataset to the LJSpeech
CONFIG['audio']['stats_path'] = None
CONFIG['output_path'] = '../'
CONFIG['phoneme_cache_path'] = '../Models/phoneme_cache'
with open('config.json', 'w') as fp:
    json.dump(CONFIG, fp)

dkreutz · April 19, 2021, 6:54am

From the logs you have provided you have trained for only one epoch/195 steps - this is far too less to get any inference result. Depending on your dataset you should have some audible (not usable) inference result after 10-20.000 steps.

RTS · April 19, 2021, 9:45am

Thanks a lot for your answer!

I changed the repo to coqui-ai/TTS, and now it works (some sound is generated). This seems to be the difference:

 > Downloading model to /home/redacted/.local/share/tts/tts_models--en--ljspeech--tacotron2-DDC
 > Downloading model to /home/redacted/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2

Going back to your comment. How do I generate more steps? My run automatically stops at GLOBAL_STEP: 250, EPOCH: 1/1000.

Thanks again for the help.

dkreutz · April 19, 2021, 4:47pm

Parameter „epochs“ in config.json default is 1000. Did you change that parameter? If not, something else is wrong

RTS · April 20, 2021, 11:04am

Yes, epochs == 1000.

RTS · April 20, 2021, 1:28pm

Ok. The process got OOM killed.

> [16293.876183] Out of memory: Killed process 2482 (python3) total-vm:18159044kB, anon-rss:13814604kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:30540kB oom_score_adj:0

I have 32Go of RAM, is it expected?

Topic		Replies	Views
Error while training tacotron with multigpu TTS (Text-to-Speech) issue	0	790	April 2, 2021
OSError: [Errno 12] Cannot allocate memory TTS (Text-to-Speech)	5	1149	February 1, 2021
Clear process for generating custom voice TTS (Text-to-Speech)	4	4176	October 30, 2020
Only CPU available, GPU training breaks TTS (Text-to-Speech)	5	1076	January 9, 2020
Synthesis error TTS (Text-to-Speech)	0	458	January 15, 2020

Empty sounds with LJS + GL

Related topics