Cannot continue training model with Forward Attn+Batch Norm

georroussos · February 22, 2020, 12:06pm

Cheers everyone. With the release of the model trained on Forward Attention and Batch Normalization afterwards, I thought I would see how it would fare when retraining on a new voice. I have a clean dataset, LJSpeech-ed (directories and metadata), however it refuses to train, on an assertion error:

I tried with LJSpeech instead, and got the same error. I am on the dev branch, that the model was also trained on and continue training using the model config and the checkpoint. I have decreased min_seq_length and have also tried removing any wavs shorter than 1 second, but it did not help. Would anyone have any tips? Thanks!

sanjaesc · February 22, 2020, 12:41pm

Did you also remove the files from the metadata.csv?

georroussos · February 22, 2020, 12:47pm

I did, yeah. I also tried with a small subset of 10 sentences that do not include any numbers and are longer than 1 second and it is still the same.

nmstoker · February 23, 2020, 12:48pm

Sorry for the obvious question, but what size is sentence35.wav in that exact directory path? (best to copy paste the path to be 100% sure) It does seem like the problem might be something else given your suggestion that you tried it with LJSpeech as well but that’s the first step. Then look at the code in TTSDataset.py and work back from line 119

georroussos · February 23, 2020, 1:10pm

No need to be sorry; the wav is 5 words long and some kb’s. The problem isn’t the file itself; the whole dataset isn’t read (mine or LJSpeech). First, it complains about sentences where numbers are present. Then, I remove these sentences with a script (csv file and wav’s). Then it complains about sentences with 1 word. Remove these and it complains about sentences with 2 words. I removed all sentences with length less than 5 and it still complains and I have also set the minimum seq length to 0. The files are not read. The model was trained on the dev branch, which is also the one I checked out in. This is my config:

“github_branch”:"* dev",
//“restore_path”:"/data/rw/pit/keep/ljspeech-December-11-2019_04+32PM-ca49ae8/checkpoint_410000.pth.tar",
“github_branch”:"* dev",
“model”: “Tacotron2”, // one of the model in models/
“run_name”: “ljspeech-bn”,
“run_description”: “tacotron2 basline finetuned with BN prenet”,

// AUDIO PARAMETERS
"audio":{
    // Audio processing parameters
    "trim_db": 0,
    "mulaw": true,
    "num_mels": 80,         // size of the mel spec frame. 
    "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
    "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
    "frame_length_ms": 10.0,  // stft window length in ms.
    "frame_shift_ms": 2.0, // stft window hop-lengh in ms.
    "preemphasis": 0.98,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "min_level_db": -100,   // normalization range
    "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
    "power": 1.5,           // value to sharpen wav signals after GL algorithm.
    "griffin_lim_iters": 60, // #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
    // Normalization parameters
    "signal_norm": true,    // normalize the spec values in range [0, 1]
    "symmetric_norm": true, // move normalization to range [-1, 1]
    "max_norm": 4.0,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!
    "do_trim_silence": true // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
},

// DISTRIBUTED TRAINING
"distributed":{
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

"reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

// TRAINING
"batch_size": 32,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"eval_batch_size":16,   
"r": 7,                 // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.  
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 16], [290000, 1, 32]], // ONLY TACOTRON - set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled.  
"loss_masking": true,         // enable / disable loss masking against the sequence padding.

// VALIDATION
"run_eval": true,
"test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

// OPTIMIZER
"noam_schedule": false,
"grad_clip": 1.0,                // upper limit for gradients for clipping.
"epochs": 1000,                // total number of epochs to train.
"lr": 0.0001,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_decay": false,             // if true, Noam learning rate decaying is applied through training.
"wd": 0.000001,         // Weight decay weight.
"warmup_steps": 4000,          // Noam decay steps to increase the learning rate from 0 to "lr"
"seq_len_norm": false,

// TACOTRON PRENET
"memory_size": -1,              // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame. 
"prenet_type": "bn",     // "original" or "bn".
"prenet_dropout": false,        // enable/disable dropout at prenet. 

// ATTENTION
"attention_type": "original",  // 'original' or 'graves'
"attention_heads": 5,          // number of attention heads (only for 'graves')
"attention_norm": "sigmoid",   // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
"windowing": false,            // Enables attention windowing. Used only in eval mode.
"use_forward_attn": false,      // if it uses forward attention. In general, it aligns faster.
"forward_attn_mask": false,    // Additional masking forcing monotonicity only in eval mode.
"transition_agent": false,     // enable/disable transition agent of forward attention.
"location_attn": true,        // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
"bidirectional_decoder": false,  // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.

// STOPNET
"stopnet": true,               // Train stopnet predicting the end of synthesis. 
"separate_stopnet": true,     // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.

// TENSORBOARD and LOGGING
"print_step": 25,       // Number of steps to log traning on console.
"save_step": 10000,      // Number of training steps expected to save traninpg stats and checkpoints.
"checkpoint": true,     // If true, it saves checkpoints per "save_step"
"tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging. 

// DATA LOADING
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
"num_loader_workers": 5,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4,    // number of evaluation data loader processes.
"batch_group_size": 0,  //Number of batches to shuffle after bucketing.
"min_seq_len": 1,       // DATASET-RELATED: minimum text length to use in training
"max_seq_len": 150,     // DATASET-RELATED: maximum text length

// PATHS
"output_path": "/data/rw/pit/keep/",      // DATASET-RELATED: output path for all training outputs.
//"output_path": "/media/erogol/data_ssd/Models/runs/",

// PHONEMES
"phoneme_cache_path": "ljspeech_ph_cache",  // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": true,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_language": "en-us",     // depending on your target language, pick one from  https://github.com/bootphon/phonemizer#languages
// MULTI-SPEAKER and GST
"use_speaker_embedding": false,     // use speaker embedding to enable multi-speaker learning.
"style_wav_for_test": null,          // path to style wav file to be used in TacotronGST inference.
"use_gst": false,       // TACOTRON ONLY: use global style tokens

// DATASETS
"datasets":   // List of datasets. They all merged and they get different speaker_ids.
    [
        {
            "name": "ljspeech",
            "path": "/home/georgios_roussos/TTS/LJSpeech-1.1/",
            //"path": "/home/erogol/Data/LJSpeech-1.1",
            "meta_file_train": "metadata_train.csv",
            "meta_file_val": "metadata_val.csv"
        }
    ]

}

I would really like to see how the model performs when re-trained on a new voice, because the performance it achieves on LJSpeech is very impressive and I am doing research on speaker adaptation.

georroussos · February 24, 2020, 12:09pm

Hi again, for anyone that might possibly struggle with it ﹣I found the config.json was the culprit; it would train okay with the initial config from the branch. The settings I changed on the checkpoint one were do_trim_db to 60 (I don’t think it was there), added do_trim_silence and disabled prenet_dropout (I think). So, good to play with these in case problems occur.

Shall return with updates on performance in regards to tuning!