SVD Not Converging During Validation Phase

LegendBegins · October 3, 2020, 4:59pm

Hi,

I’m trying to fine tune this Tacotron 2 model using a voice from the libri_tts dataset. However, whenever the training gets to the validation phase, it raises the following error:
numpy.linalg.linalg.LinAlgError: SVD did not converge
I looked into it, and it appears that the basis for the Mel Spectrogram being generated in audio.py’s _build_mel_basis function is not properly being inverted in _mel_to_linear. I tried isolating the variables and creating the matrix in a standalone Python file, which pinv was able to invert without any issues. I’ve successfully fine tuned one of the Tacotron 1 models with this dataset, but I can’t seem to manage to find a fix for this one. I’ve reproduced the functions below, along with my config file. The only major change I made was to the sample rate to match up with the new data. I appreciate any and all advice.

def _mel_to_linear(self, mel_spec):
    inv_mel_basis = np.linalg.pinv(self._build_mel_basis())
    return np.maximum(1e-10, np.dot(inv_mel_basis, mel_spec))

def _build_mel_basis(self, ):
    n_fft = (self.num_freq - 1) * 2
    if self.mel_fmax is not None:
        assert self.mel_fmax <= self.sample_rate // 2
    return librosa.filters.mel(
        self.sample_rate,
        n_fft,
        n_mels=self.num_mels,
        fmin=self.mel_fmin,
        fmax=self.mel_fmax)

{
"github_branch":"* dev",
"restore_path":"A:\\Other\\Installations\\chatbot\\Speech_Synthesis\\TTS_Final_Load_Test\\TTS\\models\\best_model.pth.tar",
"github_branch":"* dev",
    "model": "Tacotron2",          // one of the model in models/  
    "run_name": "ljspeech-bn",
    "run_description": "tacotron2 basline finetuned with BN prenet",

    // AUDIO PARAMETERS
    "audio":{
        // Audio processing parameters
        "num_mels": 80,         // size of the mel spec frame. 
        "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
        "sample_rate": 24000,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "frame_length_ms": 50.0,  // stft window length in ms.
        "frame_shift_ms": 12.5, // stft window hop-lengh in ms.
        "preemphasis": 0.98,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "min_level_db": -100,   // normalization range
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
        "power": 1.5,           // value to sharpen wav signals after GL algorithm.
        "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
        // Normalization parameters
        "signal_norm": true,    // normalize the spec values in range [0, 1]
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "mel_fmin": 0.0,         // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 8000.0,        // maximum freq level for mel-spec. Tune for dataset!!
        "do_trim_silence": true  // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
    },

    // DISTRIBUTED TRAINING
    "distributed":{
        "backend": "nccl",
        "url": "tcp:\/\/localhost:54321"
    },

    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

    // TRAINING
    "batch_size": 32,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
    "eval_batch_size":16,   
    "r": 7,                 // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.  
    "gradual_training": null, //[[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 16], [290000, 1, 32]], // ONLY TACOTRON - set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled.  
    "loss_masking": true,         // enable / disable loss masking against the sequence padding.

    // VALIDATION
    "run_eval": true,
    "test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
    "test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

    // OPTIMIZER
    "noam_schedule": false,        // use noam warmup and lr schedule.
    "grad_clip": 1,                // upper limit for gradients for clipping.
    "epochs": 1000,                // total number of epochs to train.
    "lr": 0.0001,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
    "lr_decay": false,             // if true, Noam learning rate decaying is applied through training.
    "wd": 0.000001,         // Weight decay weight.
    "warmup_steps": 4000,          // Noam decay steps to increase the learning rate from 0 to "lr"
    "seq_len_norm": false,	   // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.

    // TACOTRON PRENET
    "memory_size": -1,              // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame. 
    "prenet_type": "bn",     // "original" or "bn".
    "prenet_dropout": false,        // enable/disable dropout at prenet. 

    // ATTENTION
    "attention_type": "original",  // 'original' or 'graves'
    "attention_heads": 5,          // number of attention heads (only for 'graves')
    "attention_norm": "sigmoid",   // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
    "windowing": false,            // Enables attention windowing. Used only in eval mode.
    "use_forward_attn": false,      // if it uses forward attention. In general, it aligns faster.
    "forward_attn_mask": false,    // Additional masking forcing monotonicity only in eval mode.
    "transition_agent": false,     // enable/disable transition agent of forward attention.
    "location_attn": true,        // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
    "bidirectional_decoder": false,  // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.

    // STOPNET
    "stopnet": true,               // Train stopnet predicting the end of synthesis. 
    "separate_stopnet": true,     // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.

    // TENSORBOARD and LOGGING
    "print_step": 25,       // Number of steps to log traning on console.
    "save_step": 10000,      // Number of training steps expected to save traninpg stats and checkpoints.
    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
    "tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging. 
    
    // DATA LOADING
    "text_cleaner": "phoneme_cleaners",
    "enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
    "batch_group_size": 0,  //Number of batches to shuffle after bucketing.
    "min_seq_len": 6,       // DATASET-RELATED: minimum text length to use in training
    "max_seq_len": 150,     // DATASET-RELATED: maximum text length

    // PATHS
    "output_path": "A:\\Other\\Installations\\chatbot\\Speech_Synthesis\\TTS_Final_Load_Test\\TTS\\Outputs",      // DATASET-RELATED: output path for all training outputs.
    //"output_path": "/media/erogol/data_ssd/Models/runs/",
 
    // PHONEMES
    "phoneme_cache_path": "ljspeech_ph_cache",  // phoneme computation is slow, therefore, it caches results in the given folder.
    "use_phonemes": true,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
    "phoneme_language": "en-us",     // depending on your target language, pick one from  https://github.com/bootphon/phonemizer#languages
    // MULTI-SPEAKER and GST
    "use_speaker_embedding": false,     // use speaker embedding to enable multi-speaker learning.
    "style_wav_for_test": null,          // path to style wav file to be used in TacotronGST inference.
    "use_gst": false,       // TACOTRON ONLY: use global style tokens

    // DATASETS
    "datasets":   // List of datasets. They all merged and they get different speaker_ids.
        [
            {
                "name": "libri_tts",
                "path": "A:\\Other\\Installations\\chatbot\\Speech_Synthesis\\TTS_Old_2\\TTS\\RC_Voice_Source",
                //"path": "/home/erogol/Data/LJSpeech-1.1",
                "meta_file_train": null,
                "meta_file_val": null
            }
        ]

}

othiele · October 4, 2020, 1:23pm

What branch are you using? Most recent - and working - code is in the dev branch.

LegendBegins · October 4, 2020, 3:54pm

I was using the branch linked with the model (specifically this one). I ran a quick test with the latest version, and this model’s architecture appears to be incompatible with the most recent TTS. Is there a commit I missed that fixes this issue without breaking compatibility?

othiele · October 4, 2020, 6:17pm

As I said, don’t use anything but dev currently. The code you are referencing is 8 months old. As long as there is no release, stick to dev. I tried patching in between, it is a nightmare. I know it’s hard, throw everything away and start over.

LegendBegins · October 4, 2020, 7:12pm

It’s not a matter of throwing away work—switching between versions of TTS is only a few minutes of effort. I’m using a pretrained model and fine tuning it, and those models are only compatible with the architecture that was present when they were released. In other words, if I switch to the most recent version of the codebase, I would have to scrap this model that I didn’t generate or have the means to generate, because each model is locked under the version of Mozila TTS it was created in. The reason the models in the wiki provide a link to the branch they were created under is because they usually stop working after a significant update to the architecture.

othiele · October 4, 2020, 7:24pm

I understand completely and I meant exactly that: throwing away your previous model.

As @erogol is already in another role at Mozilla he doesn’t have much time to support here and I don’t see why code that was training before doesn’t train now. If continuing a training worked back then? Maybe try the old data to see whether this happens or state some more info and maybe erogol has a good answer. But it is old code and a strange error.

sanjaesc · October 4, 2020, 10:15pm

Did you try fine-tuning this model?

LegendBegins · October 5, 2020, 6:54am

I just loaded that model and its corresponding version, and it raises the same exception.

othiele · October 5, 2020, 8:51am

Please give some more info, I guess you haven’t used your own model? Then this would suggest that the error stems from one of your dependencies? Try to load everything in Google Colab, this would be sharable for us to help or use @nmstoker’s GatherUp.

erogol · October 5, 2020, 12:21pm

I think SVD is running to estimate reverse mel-filters to convert mel-specs to linear-specs before runnine GL algorithm. I’ve never encountered that personally. It is more likely to be about the numpy or scipy version. Please try to reinstall the environment using the requirements.txt file and try different versions if it does not work.

LegendBegins · October 5, 2020, 5:35pm

@othiele
Correct, I’m not training a model from scratch, and I don’t have the data necessary to create one (which is why I’m trying to fine tune an already successful model). This dataset worked on one of the Tacotron 1 models as well. I’d be more than glad to run the tool, but keep in mind that I’ve tried well over a dozen dependency combinations, so it might not provide a comprehensive view of what I’ve loaded in previous trials.

@erogol
I reinstalled the environment from scratch and iterated through versions of scipy, librosa, and numpy (down to 1.15.0), all of which generate the same error. I tried modifying torch’s version for good measure as well.
Looking at the _mel_to_linear function, the error arises when it tries to pseudo-invert the mel_basis, which is the following:
librosa.filters.mel(self.sample_rate, n_fft, n_mels=self.num_mels, fmin=self.mel_fmin, fmax=self.mel_fmax)
Substituting out my own parameters results in this:
librosa.filters.mel(24000, 2048, n_mels=80, fmin=0, fmax=8000.0)
I’m not familiar with the internals of this function, but given that it’s provided constants (and that these constants generate a valid p-invertible matrix in a standalone Python file), I assume that the output of librosa.filters.mel is dependent on something else in the program that isn’t passed as an argument. If not, I’m at a loss.