Mozilla TTS output voice still sounds robotic after almost 400K

Hi All,
I trained my audio set which consists of about 27 hours of auidos of 10 seconds length and 16000 Hz sample rate with Tacotron2. It took about 4 and half days to train. At this stage the test audios still sounds a little bit robotic. And in some of our test audios some words are missing. In some audios there are repetitions. When I look at the graphs on the tensorboard pages, the graphs look normal.



My config values mostly are according to the default values. Could anyone have a look at my configs and let me know what could be wrong? Is there any parameters that I can change to remove robotic sound from the test values and improve the output waves quality?
My configs are as blow:
“model”: “Tacotron2”,
“run_name”: “stspeech-stft_params”,
“run_description”: “tacotron2 constant stf parameters”,
“audio”:{
“num_mels”: 80,
“num_freq”: 1025,
“sample_rate”: 16000,
“win_length”: 1024,
“hop_length”: 256,
“frame_length_ms”: null,
“frame_shift_ms”: null,
“preemphasis”: 0.98,
“min_level_db”: -100,
“ref_level_db”: 20,
“power”: 1.5,
“griffin_lim_iters”: 30,
“signal_norm”: true,
“symmetric_norm”: true,
“max_norm”: 4.0,
“clip_norm”: true,
“mel_fmin”: 0.0,
“mel_fmax”: 8000.0,
“do_trim_silence”: true,
“trim_db”: 60
},
“characters”:{
“pad”: “_”,
“eos”: “~”,
“bos”: “^”,
“characters”: "ABCDEFGHIJKLMNOPQRSTUVWXYZÇĞİÖŞÜabcdefghijklmnopqrstuvwxyzçğıöşü!’(),-.:;? “,
“punctuations”:”!’(),-.:;? ",
“phonemes”:“iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ”
},

"distributed":{
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

“reinit_layers”: [],
“batch_size”: 32,
“eval_batch_size”:16,
“r”: 7,
“gradual_training”: [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]],
“loss_masking”: true,
“run_eval”: true,
“test_delay_epochs”: 5,
“test_sentences_file”: “tr_sentences.txt”,

"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 1000,
"lr": 0.00001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,

"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,

"attention_type": "original",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": false,
"bidirectional_decoder": false,
"stopnet": true,
"separate_stopnet": true,

"print_step": 5,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,

"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 1,
"num_val_loader_workers": 1,
"batch_group_size": 0,
"min_seq_len": 6,
"max_seq_len": 150,

"output_path": "train_logs/",

"phoneme_cache_path": "mozilla_tr_phonemes_2_1",
"use_phonemes": true,
"phoneme_language": "tr",

"use_speaker_embedding": false,
"style_wav_for_test": null,
"use_gst": false,

"datasets":
    [
        {
            "name": "stspeech",
            "path": "STS-22K/",
            "meta_file_train": "metadata_train.csv",
            "meta_file_val": "metadata_test.csv"
        }
    ]

}

Four config parameters are different than the default values:
1: the sample rate
2: “griffin_lim_iters” , I reduced it to 30 the default was 60. I did this to reduce the training time.
3: I reduced number of workers to 1, the defaults were 4. I thought that is something to the with number of GPUS. Since I have just one GPU, I thought I need to change them as 1.
4: min and max seq length parameters. Actually I forgot to change them according to the my data’s lengths. How much effect does it have on the quality?

I appreciate any insights or comments or suggestions about what could be wrong with my training.
Many many thanks in advance.

2: “griffin_lim_iters” , I reduced it to 30 the default was 60. I did this to reduce the training time.

GriffinLim has nothing to do with the training speed. It’s an algorithm used to synthesize speech.

3: I reduced number of workers to 1, the defaults were 4. I thought that is something to the with number of GPUS. Since I have just one GPU, I thought I need to change them as 1.

num_workers are used to load the data into batches during training, it’s using CPU… so you actually might slow down training setting it to 1… default 4 should be fine.

At this stage the test audios still sounds a little bit robotic.

Default TTS (using GriffinLim as vocoder) will always sound robotic.
If you want natural sounding speech… neural vocoders are what you are looking for.

Mozilla TTS has adapted https://github.com/erogol/WaveRNN or https://github.com/erogol/ParallelWaveGAN as such.

2 Likes

Hi sanjaesc,
Thanks a lot. That was very helpful.