Mozilla TTS output voice still sounds robotic after almost 400K

Yilmaz_Ay · April 13, 2020, 6:45am

Hi All,
I trained my audio set which consists of about 27 hours of auidos of 10 seconds length and 16000 Hz sample rate with Tacotron2. It took about 4 and half days to train. At this stage the test audios still sounds a little bit robotic. And in some of our test audios some words are missing. In some audios there are repetitions. When I look at the graphs on the tensorboard pages, the graphs look normal.

My config values mostly are according to the default values. Could anyone have a look at my configs and let me know what could be wrong? Is there any parameters that I can change to remove robotic sound from the test values and improve the output waves quality?
My configs are as blow:
“model”: “Tacotron2”,
“run_name”: “stspeech-stft_params”,
“run_description”: “tacotron2 constant stf parameters”,
“audio”:{
“num_mels”: 80,
“num_freq”: 1025,
“sample_rate”: 16000,
“win_length”: 1024,
“hop_length”: 256,
“frame_length_ms”: null,
“frame_shift_ms”: null,
“preemphasis”: 0.98,
“min_level_db”: -100,
“ref_level_db”: 20,
“power”: 1.5,
“griffin_lim_iters”: 30,
“signal_norm”: true,
“symmetric_norm”: true,
“max_norm”: 4.0,
“clip_norm”: true,
“mel_fmin”: 0.0,
“mel_fmax”: 8000.0,
“do_trim_silence”: true,
“trim_db”: 60
},
“characters”:{
“pad”: “_”,
“eos”: “~”,
“bos”: “^”,
“characters”: "ABCDEFGHIJKLMNOPQRSTUVWXYZÇĞİÖŞÜabcdefghijklmnopqrstuvwxyzçğıöşü!‘(),-.:;? “,
“punctuations”:”!’(),-.:;? ",
“phonemes”:“iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ”
},

"distributed":{
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

“reinit_layers”: ,
“batch_size”: 32,
“eval_batch_size”:16,
“r”: 7,
“gradual_training”: [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]],
“loss_masking”: true,
“run_eval”: true,
“test_delay_epochs”: 5,
“test_sentences_file”: “tr_sentences.txt”,

"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 1000,
"lr": 0.00001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,

"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,

"attention_type": "original",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": false,
"bidirectional_decoder": false,
"stopnet": true,
"separate_stopnet": true,

"print_step": 5,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,

"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 1,
"num_val_loader_workers": 1,
"batch_group_size": 0,
"min_seq_len": 6,
"max_seq_len": 150,

"output_path": "train_logs/",

"phoneme_cache_path": "mozilla_tr_phonemes_2_1",
"use_phonemes": true,
"phoneme_language": "tr",

"use_speaker_embedding": false,
"style_wav_for_test": null,
"use_gst": false,

"datasets":
    [
        {
            "name": "stspeech",
            "path": "STS-22K/",
            "meta_file_train": "metadata_train.csv",
            "meta_file_val": "metadata_test.csv"
        }
    ]

}

Four config parameters are different than the default values:
1: the sample rate
2: “griffin_lim_iters” , I reduced it to 30 the default was 60. I did this to reduce the training time.
3: I reduced number of workers to 1, the defaults were 4. I thought that is something to the with number of GPUS. Since I have just one GPU, I thought I need to change them as 1.
4: min and max seq length parameters. Actually I forgot to change them according to the my data’s lengths. How much effect does it have on the quality?

I appreciate any insights or comments or suggestions about what could be wrong with my training.
Many many thanks in advance.

sanjaesc · April 13, 2020, 3:43pm

2: “griffin_lim_iters” , I reduced it to 30 the default was 60. I did this to reduce the training time.

GriffinLim has nothing to do with the training speed. It’s an algorithm used to synthesize speech.

3: I reduced number of workers to 1, the defaults were 4. I thought that is something to the with number of GPUS. Since I have just one GPU, I thought I need to change them as 1.

num_workers are used to load the data into batches during training, it’s using CPU… so you actually might slow down training setting it to 1… default 4 should be fine.

At this stage the test audios still sounds a little bit robotic.

Default TTS (using GriffinLim as vocoder) will always sound robotic.
If you want natural sounding speech… neural vocoders are what you are looking for.

Mozilla TTS has adapted GitHub - erogol/WaveRNN: Pytorch implementation of Deepmind's WaveRNN model or GitHub - erogol/ParallelWaveGAN: ParallelWaveGAN adaptation for Mozilla TTS as such.

Yilmaz_Ay · April 14, 2020, 6:42am

Hi sanjaesc,
Thanks a lot. That was very helpful.

Topic		Replies	Views
Tacotron2: bad test synthesis results TTS (Text-to-Speech)	1	2386	March 1, 2020
My Success with Mozilla TTS TTS (Text-to-Speech)	7	7103	January 21, 2021
Query regarding post processing TTS (Text-to-Speech)	49	2149	September 19, 2019
Results of a model for my native language TTS (Text-to-Speech)	1	498	July 15, 2020
Noob need help with Mozilla TTS TTS (Text-to-Speech)	3	965	August 26, 2020

Mozilla TTS output voice still sounds robotic after almost 400K

Related topics