Hello,
I followed this and that to generate a model from LJS and synthetize a sound. But all I get are zeros and the sound is empty.
I am not sure what I did wrong, and I am all new to this. Thank you very much for your help.
I have embedded all the commands in a Makefile and a Python venv (see at the end of the post), I hope that will not confuse you. Thank you very much for your help.
Here are my logs, see the hexdump of the WAV at the end:
**myhost$ make run**
source _venv/bin/activate; PYTHONPATH=run python3 config.py
source _venv/bin/activate; cd TTS && python3 TTS/bin/train_tacotron.py --config_path ../config.json | tee training.log
2021-04-18 04:44:38.213636: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcud
art.so.10.1: cannot open shared object file: No such file or directory
2021-04-18 04:44:38.213657: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/redacted/tts/_venv/lib/python3.6/site-packages/torch/cuda/amp/grad_scaler.py:116: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.")
> Using CUDA: False
> Number of GPUs: 0
> Mixed precision mode is ON
> Git Hash: e9e0784
> Experiment folder: ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784
> Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:None
| > hop_length:256
| > win_length:1024
| > Found 13100 files in /home/redacted/tts/LJSpeech-1.1
> Using model: Tacotron2
> Model has 47914548 parameters
> DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 12969
| > Max length sequence: 187
| > Min length sequence: 5
| > Avg length sequence: 98.3403500655409
| > Num. instances discarded by max-min (max=153, min=6) seq limits: 476
| > Batch group size: 128.
> EPOCH: 0/1000
> Number of output frames: 7
> TRAINING (2021-04-18 04:44:39)
/home/redacted/tts/_venv/lib/python3.6/site-packages/torch/cuda/amp/autocast_mode.py:118: UserWarning: torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.
warnings.warn("torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.")
--> STEP: 24/195 -- GLOBAL_STEP: 25
| > decoder_loss: 4.55581 (4.81456)
| > postnet_loss: 2.73150 (5.05209)
| > stopnet_loss: 0.72497 (0.76779)
| > decoder_coarse_loss: 4.19881 (4.71119)
| > decoder_ddc_loss: 0.00184 (0.00324)
| > ga_loss: 0.00559 (0.00769)
| > decoder_diff_spec_loss: 0.01748 (0.01792)
| > postnet_diff_spec_loss: 2.11584 (3.49522)
| > decoder_ssim_loss: 0.53970 (0.55583)
| > postnet_ssim_loss: 0.53068 (0.55149)
| > loss: 5.88996 (7.22106)
| > align_error: 0.98973 (0.98480)
| > max_spec_length: 613.0
| > max_text_length: 103.0
| > step_time: 18.5761
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 49/195 -- GLOBAL_STEP: 50
| > decoder_loss: 0.57258 (3.66746)
| > postnet_loss: 0.96710 (3.34215)
| > stopnet_loss: 0.50418 (0.72119)
| > decoder_coarse_loss: 0.36679 (3.19028)
| > decoder_ddc_loss: 0.00296 (0.00259)
| > ga_loss: 0.00397 (0.00620)
| > decoder_diff_spec_loss: 0.16638 (0.08393)
| > postnet_diff_spec_loss: 1.61417 (2.50230)
| > decoder_ssim_loss: 0.54864 (0.58755)
| > postnet_ssim_loss: 0.56227 (0.58478)
| > loss: 1.45568 (5.23634)
| > align_error: 0.98672 (0.98627)
| > max_spec_length: 730.0
| > max_text_length: 103.0
| > step_time: 21.6303
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 74/195 -- GLOBAL_STEP: 75
| > decoder_loss: 0.16894 (2.50801)
| > postnet_loss: 0.87962 (2.49905)
| > stopnet_loss: 0.33417 (0.61048)
| > decoder_coarse_loss: 0.12851 (2.17494)
| > decoder_ddc_loss: 0.00166 (0.00257)
| > ga_loss: 0.00329 (0.00531)
| > decoder_diff_spec_loss: 0.09778 (0.09582)
| > postnet_diff_spec_loss: 1.26165 (2.06643)
| > decoder_ssim_loss: 0.56711 (0.59162)
| > postnet_ssim_loss: 0.60017 (0.59898)
| > loss: 1.01759 (3.83230)
| > align_error: 0.98966 (0.98666)
| > max_spec_length: 812.0
| > max_text_length: 128.0
| > step_time: 24.9676
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 99/195 -- GLOBAL_STEP: 100
| > decoder_loss: 0.13837 (1.90996)
| > postnet_loss: 0.53094 (2.03662)
| > stopnet_loss: 0.33364 (0.54072)
| > decoder_coarse_loss: 0.09041 (1.65261)
| > decoder_ddc_loss: 0.00197 (0.00238)
| > ga_loss: 0.00286 (0.00473)
| > decoder_diff_spec_loss: 0.07020 (0.09270)
| > postnet_diff_spec_loss: 0.83398 (1.78797)
| > decoder_ssim_loss: 0.64371 (0.59933)
| > postnet_ssim_loss: 0.68419 (0.61408)
| > loss: 0.82043 (3.08881)
| > align_error: 0.98795 (0.98731)
| > max_spec_length: 774.0
| > max_text_length: 115.0
| > step_time: 24.1785
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 124/195 -- GLOBAL_STEP: 125
| > decoder_loss: 0.07066 (1.54349)
| > postnet_loss: 0.39794 (1.72865)
| > stopnet_loss: 0.31882 (0.49629)
| > decoder_coarse_loss: 0.05755 (1.33377)
| > decoder_ddc_loss: 0.00146 (0.00222)
| > ga_loss: 0.00246 (0.00431)
| > decoder_diff_spec_loss: 0.04339 (0.08536)
| > postnet_diff_spec_loss: 0.58380 (1.57657)
| > decoder_ssim_loss: 0.68006 (0.60992)
| > postnet_ssim_loss: 0.73716 (0.63143)
| > loss: 0.68772 (2.61927)
| > align_error: 0.98990 (0.98774)
| > max_spec_length: 812.0
| > max_text_length: 136.0
| > step_time: 26.3236
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 149/195 -- GLOBAL_STEP: 150
| > decoder_loss: 0.03828 (1.29419)
| > postnet_loss: 0.33461 (1.50368)
| > stopnet_loss: 0.30752 (0.46461)
| > decoder_coarse_loss: 0.03762 (1.11788)
| > decoder_ddc_loss: 0.00112 (0.00207)
| > ga_loss: 0.00220 (0.00398)
| > decoder_diff_spec_loss: 0.02056 (0.07618)
| > postnet_diff_spec_loss: 0.46399 (1.40584)
| > decoder_ssim_loss: 0.66767 (0.61895)
| > postnet_ssim_loss: 0.76170 (0.64922)
| > loss: 0.61165 (2.29043)
| > align_error: 0.99073 (0.98812)
| > max_spec_length: 853.0
| > max_text_length: 147.0
| > step_time: 27.6148
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 174/195 -- GLOBAL_STEP: 175
| > decoder_loss: 0.03072 (1.11359)
| > postnet_loss: 0.24894 (1.32778)
| > stopnet_loss: 0.29823 (0.44101)
| > decoder_coarse_loss: 0.02360 (0.96161)
| > decoder_ddc_loss: 0.00088 (0.00193)
| > ga_loss: 0.00203 (0.00371)
| > decoder_diff_spec_loss: 0.00659 (0.06693)
| > postnet_diff_spec_loss: 0.35491 (1.26145)
| > decoder_ssim_loss: 0.59823 (0.62161)
| > postnet_ssim_loss: 0.79244 (0.66808)
| > loss: 0.53805 (2.04356)
| > align_error: 0.99087 (0.98846)
| > max_spec_length: 859.0
| > max_text_length: 147.0
| > step_time: 28.2039
| > loader_time: 0.00
| > current_lr: 0.0001
--> TRAIN PERFORMACE -- EPOCH TIME: 4536.91 sec -- GLOBAL_STEP: 196
| > avg_decoder_loss: 0.99635
| > avg_postnet_loss: 1.21014
| > avg_stopnet_loss: 0.42589
| > avg_decoder_coarse_loss: 0.86015
| > avg_decoder_ddc_loss: 0.00181
| > avg_ga_loss: 0.00352
| > avg_decoder_diff_spec_loss: 0.06030
| > avg_postnet_diff_spec_loss: 1.16087
| > avg_decoder_ssim_loss: 0.61834
| > avg_postnet_ssim_loss: 0.68446
| > avg_loss: 1.88026
| > avg_align_error: 0.98874
| > avg_loader_time: 0.00470
| > avg_step_time: 23.20372
> EVALUATION
--> EVAL PERFORMANCE
| > avg_decoder_loss: 0.04355 (+0.00000)
| > avg_postnet_loss: 3.91026 (+0.00000)
| > avg_stopnet_loss: 0.34370 (+0.00000)
| > avg_decoder_coarse_loss: 0.00959 (+0.00000)
| > avg_decoder_ddc_loss: 0.00102 (+0.00000)
| > avg_ga_loss: 0.00230 (+0.00000)
| > avg_decoder_diff_spec_loss: 0.00559 (+0.00000)
| > avg_postnet_diff_spec_loss: 0.00602 (+0.00000)
| > avg_decoder_ssim_loss: 0.54868 (+0.00000)
| > avg_postnet_ssim_loss: 0.80211 (+0.00000)
| > avg_loss: 1.35677 (+0.00000)
| > avg_align_error: 0.98519 (+0.00000)
>> BEST MODEL : ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784/best_model.pth.tar
> EPOCH: 1/1000
> Number of output frames: 5
> TRAINING (2021-04-18 06:01:22)
/home/redacted/tts/_venv/lib/python3.6/site-packages/torch/cuda/amp/autocast_mode.py:118: UserWarning: torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.
warnings.warn("torch.cuda.amp.autocast only affects CUDA ops, but CUDA is not available. Disabling.")
--> STEP: 3/195 -- GLOBAL_STEP: 200
| > decoder_loss: 0.03129 (0.02660)
| > postnet_loss: 1.62624 (1.47613)
| > stopnet_loss: 0.50374 (0.53868)
| > decoder_coarse_loss: 0.01738 (0.01841)
| > decoder_ddc_loss: 0.00537 (0.00533)
| > ga_loss: 0.00790 (0.00815)
| > decoder_diff_spec_loss: 0.00566 (0.00677)
| > postnet_diff_spec_loss: 1.92546 (1.72932)
| > decoder_ssim_loss: 0.31401 (0.34279)
| > postnet_ssim_loss: 0.46388 (0.48827)
| > loss: 1.15032 (1.07673)
| > align_error: 0.97053 (0.96764)
| > max_spec_length: 319.0
| > max_text_length: 58.0
| > step_time: 9.9091
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 28/195 -- GLOBAL_STEP: 225
| > decoder_loss: 0.08435 (0.09471)
| > postnet_loss: 0.53375 (1.20755)
| > stopnet_loss: 0.34806 (0.37907)
| > decoder_coarse_loss: 0.01095 (0.01401)
| > decoder_ddc_loss: 0.00317 (0.00300)
| > ga_loss: 0.00315 (0.00475)
| > decoder_diff_spec_loss: 0.00244 (0.00390)
| > postnet_diff_spec_loss: 0.81396 (1.62014)
| > decoder_ssim_loss: 0.38636 (0.33294)
| > postnet_ssim_loss: 0.64465 (0.52186)
| > loss: 0.66029 (1.00122)
| > align_error: 0.97028 (0.97328)
| > max_spec_length: 486.0
| > max_text_length: 71.0
| > step_time: 22.5940
| > loader_time: 0.00
| > current_lr: 0.0001
--> STEP: 53/195 -- GLOBAL_STEP: 250
| > decoder_loss: 0.06341 (0.08091)
| > postnet_loss: 0.59418 (1.01696)
| > stopnet_loss: 0.28890 (0.33931)
| > decoder_coarse_loss: 0.00840 (0.01181)
| > decoder_ddc_loss: 0.00217 (0.00264)
| > ga_loss: 0.00242 (0.00380)
| > decoder_diff_spec_loss: 0.00181 (0.00307)
| > postnet_diff_spec_loss: 0.85478 (1.37726)
| > decoder_ssim_loss: 0.34847 (0.33493)
| > postnet_ssim_loss: 0.63196 (0.55394)
| > loss: 0.65689 (0.88821)
| > align_error: 0.97703 (0.97537)
| > max_spec_length: 642.0
| > max_text_length: 96.0
| > step_time: 26.6962
| > loader_time: 0.00
| > current_lr: 0.0001
**myhost$ make gen**
source _venv/bin/activate; cd TTS && tts --text "Hello my friends" --model_path ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784/best_model.pth.tar --config_path ../config.json --out_path ..
2021-04-18 10:07:39.581533: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-04-18 10:07:39.581556: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
> Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:None
| > hop_length:256
| > win_length:1024
> Using model: Tacotron2
> Text: Hello my friends
> Text splitted to sentences.
['Hello my friends']
| > Decoder stopped with 'max_decoder_steps
> Processing time: 20.097420930862427
> Real-time factor: 0.24595510323637348
> Saving output to ../Hello_my_friends.wav
**myhost$ hexdump -C Hello_my_friends.wav**
00000000 52 49 46 46 44 fc 36 00 57 41 56 45 66 6d 74 20 |RIFFD.6.WAVEfmt |
00000010 10 00 00 00 01 00 01 00 22 56 00 00 44 ac 00 00 |........"V..D...|
00000020 02 00 10 00 64 61 74 61 20 fc 36 00 00 00 00 00 |....data .6.....|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0036fc4c
Here is my Makefile:
PYTHON = python3
GIT = git
TTS = TTS
TTS_GIT = https://github.com/mozilla/TTS
LJSPEECH = LJSpeech-1.1
LJSPEECH_BZ2 = $(LJSPEECH).tar.bz2
VENV = _venv
DO-VENV = source $(VENV)/bin/activate;
SHELL = /bin/bash
all: $(TTS)
$(VENV):
sudo apt-get install python3 python3-dev git gcc g++ espeak-ng
$(PYTHON) -m venv $@
$(DO-VENV) pip3 install --upgrade pip
$(DO-VENV) pip3 install numpy Cython
$(TTS): $(VENV)
$(GIT) clone $(TTS_GIT) $@
$(DO-VENV) pip install -e $@
$(LJSPEECH_BZ2):
$(WGET) wget http://data.keithito.com/data/speech/$@
$(LJSPEECH): $(LJSPEECH_BZ2)
tar -xjf $<
shuf $@/metadata.csv > $@/metadata_shuf.csv
head -n 12000 $@/metadata_shuf.csv > $@/metadata_train.csv
tail -n 1100 $@/metadata_shuf.csv > $@/metadata_val.csv
run: $(TTS) $(LJSPEECH)
$(DO-VENV) PYTHONPATH=$@ $(PYTHON) config.py
$(DO-VENV) cd $(TTS) && $(PYTHON) TTS/bin/train_tacotron.py --config_path ../config.json | tee training.log
gen:
$(DO-VENV) cd $(TTS) && tts --text "Hello my friends" --model_path ../ljspeech-ddc-April-18-2021_04+44AM-e9e0784/best_model.pth.tar --config_path ../config.json --out_path ..
distclean:
$(RM) -r $(VENV) $(TTS) $(LJSPEECH)
And here is the script which edits the config.json, config.py:
# load the default config file and update with the local paths and settings.
import json
from TTS.utils.io import load_config
CONFIG = load_config('./TTS/TTS/tts/configs/config.json')
CONFIG['datasets'][0]['path'] = '../LJSpeech-1.1/' # set the target dataset to the LJSpeech
CONFIG['audio']['stats_path'] = None
CONFIG['output_path'] = '../'
CONFIG['phoneme_cache_path'] = '../Models/phoneme_cache'
with open('config.json', 'w') as fp:
json.dump(CONFIG, fp)