[Private dataset - Portuguese] Expecting healthier results at 10k+ steps

Hi everyone!

Firstly, thank you for the great TTS implementation. I was able to start training on my dataset without having any problems.

Straight to the point, I’m using a private Portuguese dataset formatted like LJSpeech and my results on tensorboard are anything but healthy:

A bit about config.json (only the parameters I changed from master branch):

  • model: Tacotron2
  • sample_rate: 44100
  • mel_fmin: 95.0
  • characters: “AÁÂÃÀBCÇDEÉÊFGHIÍJKLMNOÓÔÕPQRSTUÚVWXYZaáâãàbcçdeéêfghiíjklmnoóôpqrstuúvwxyz!’(),-.:;?” "
  • phonemes: “ãẽĩõiyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ”
  • gradual_training: [[0, 7, 32], [1, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]]
  • phoneme_language: pt-br

I’m binge reading posts from discourse and I saw somewhere that, with gradual training, the alignments graph should start to look like a diagonal anywhere near to 10k steps. Any ideas about what could be wrong? (Or maybe I just need more training).

How many phrases/wavs are in your dataset and did you run the “dataset_analysis” (notebooks and scripts)?
You may want to convert your wav-files to a lower sampling rate, e.g. 22050 and try toggling (set to true if false and vice versa) parameterstrim_silence and forward_attention

1 Like

Thank you for your answer, @dkreutz !

The dataset consists of 8286 phrases/wavs and, no, I didn’t run “dataset_analysis”. I will do that now, in conjunction with the analysis of the spectograms (results will be posted here). Could you explain why toggling both trim_silence and forward_attention true/false can help?

Update (dataset_analysis execution): AnalyzeDatasetOut.zip (84,1,KB)

Have you advanced? If so, could you share the parameters that worked?

@kms I did the changes suggested by @dkreutz (converted the audio to a lower sample rate, setted do_trim_silence to False and forward attention to True). I also changed attention norm to" softmax ".

All this led to better graphs, but the model is not capable to generalize yet, and the generated audio quality is not good (audio exploding in silence).

You are probably a little bit too impatient. To my experience you will see proper alignment and first meaningful audio at step 15-20k earliest.

1 Like

You are right, @dkreutz :sweat_smile:

I’ve trained other voice synthesis models before, and by this time (70K steps) I’d already have something to listen to - although I know that this is not a good enough metric for comparisons. I think my impatience is because If there’s something wrong with my training, I’d rather stop it “fast” and start testing another hypothesis.

In FAQ is pointed “Check model spectrograms. Especially training outputs should converge to ground truth after 10K iterations.” and in 70k iterations I still have this:

Some test audios to illustrate what I’m saying: test_audios.zip (1,6,MB)

Guess for now I’ll just wait some more while I try to find out what could be leading to problems (maybe is something DSP related).

Thank you again! Your suggestions really helped!

I had another look at your dataset analysis. It seems like you have quite long phrase with up to 800 characters? That might cause the attention/stop problem you currently have.
Maybe @erogol can have a look and comment on it? (dataset analysis is attached 3 post above).

1 Like

what you mean model exploding at silences? You mean the random noise at the end of the spectrogram ?

Yes, @erogol, exactly that. I’ve set do_trim_silence to false, so I imagine it could be due to that. What really concerns me is the inability to generalize. Attention seems ok to me, so I have no idea why this is happening.

Generated audio samples (the voice is just making sounds that look like words.): test_audios.zip (2,1,MB)

The content of test_sentences are also converted tho phonemes before it is fed to the model, right? The “mumbling” makes no sense to me.

Hi guys! I just noticed that, since first steps of training, the message Decoder stopped with max_decoder_steps is popping when synthesizing test sentences. As @dkreutz pointed, my dataset contains big phrases (since I’ll need to synthesize big sentences with model). In that case, what would be the best approach to have a successful model? I mean, would a model trained with small sentences be capable of synthesizing longer ones? I also didn’t quite understand what would happen if I increased max_decoder_steps. Could long silences or noise be generated after the synthesized sentence? Thank you again for your time and help.

Can you give examples of „big sentences“?

Sure! Some examples from dataset:

  • Assim que chegara à casa daquela senhora fora informada de que aquele senhor estivera lá uma hora antes; ao ver que ela não estava e que não voltaria logo, deixou um pequeno pacote enviado por uma das irmãs dele e foi embora.
  • Não posso imaginar que o jovem que vi conversando com você no outro dia consiga expressar-se tão bem contando apenas com a própria capacidade e, no entanto, este não é o estilo de uma mulher.

Examples of phrases I’d like to synthesize:

  • Neste sentido, a complexidade dos estudos efetuados ainda não demonstrou convincentemente que vai participar na mudança de alternativas às soluções ortodoxas.
  • O cuidado em identificar pontos críticos na adoção de políticas descentralizadoras faz parte de um processo de gerenciamento do retorno esperado a longo prazo.

I imagine that batch synthesis could be a solution to this, but I fear that the delay in the synthesis time would be too long.

According to this you try to increase max_decoder_steps.
But again - the dataset analysis show that you have phrases much longer than 200 characters. Try to keep the maximun below 200 during initial training. You can fine tune your models with longer phrases when it has aligned…

1 Like