Fine-Tuning Trained Model to New Dataset

Hey All!

Has anyone tried fine-tuning a pre-trained Tacotron2 model to new, smaller dataset? What results are you getting? What do’s and don’t’s have you discovered?

This approach is also called adapting the model, speaker adaptation, or transfer learning.

I’m currently fine-tuning a pre-trained Tacotron2 to a 15-hour in-house dataset. After 10K iterations, I can hear the new voice (not the original LJSpeech), but the model can only produce the first few words of a sentence. Nevertheless, the amount of progress in just 10K iterations is promising.

I’m thinking that learning rate will be very important here… and possibly “freezing” certain model layers. Does anyone have experience on this?

Best,
Josh

how did this experiment end up?

1 Like

Hello, I’m training a Spanish model with 60h (not cleaned), using the checkpoint from 824c091

Here’s the alignment for 280k tacotron2

:

Looks like is having a hard time with the silences (from punctuation).

Test audios:

Examples:
examples-tux.zip (331,8 KB)

I’ll upload the events log soon. Still cleaning the TTS dataset with a Spanish model of DeepSpeech which is already getting 30h of clean speech, I’ll fine tune and see if I can get more reviewed transcriptions.
My plan is to publish it under public domain, any suggestion where to release it?

The speaker: https://librivox.org/reader/3946?primary_key=3946&search_category=reader&search_page=1&search_form=get_results

Still need to test WaveRNN, I’ll keep updating.

For all involved, thanks for the amazing work!

1 Like

Canceled the training, I see that the silence at the end is the issue with the punctuation, the trim silence is enabled, insolated the trimming parts to test it, and yes it is working, the silence is being removed, any ideas?

Captura%20de%20pantalla%20(910)

It’s been slow going so far , @erogol.

I took the trained (on LJSpeech) Tacotron2 model from 824c091, and began fine-tuning on an in-house ~15 (male, professionally recorded) dataset.

At first, I fine-tuned all the weights to the new data with a learning rate of 0.00001, and "lr_decay": false. This lead to no real learning and garbage output.

You can see in the below graph that both training and eval Loss curves flatline quickly:

LOSS:
   0.004 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+   
         +       +        +       +       +       +        +       +       +   
  0.0035 A-+                                 eval Loss == A              +-+   
         |                                   train Loss == B               |   
         |                                                                 |   
   0.003 +-+                                                             +-+   
         |                                                                 |   
  0.0025 +-+                                                             +-+   
         |                                                                 |   
   0.002 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
  0.0015 A-+                                                             +-+   
         AA                                                                |   
   0.001 +AA                                                             +-+   
         BBAAAAAAAAAAAAAA A                                                |   
         |BBBBBBBBBBBB  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA      |   
  0.0005 +-+      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB    +-+   
         +       +        +       +       +       +        +       +       +   
       0 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+   
         0       50      100     150     200     250      300     350     400
                                              ITERATION #

Then I tried only fine-tuning the postnet from the original pre-trained model, with not great results:

LOSS:
   0.004 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+   
         +       +        +       +       +       +        +       +       +   
         |                                               eval Loss  A      |   
  0.0035 +-+                                              eval Loss  B   +-+   
         |                                                                 |   
   0.003 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
  0.0025 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
   0.002 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
  0.0015 +-A                                                             +-+   
         |BAA                                                              |   
   0.001 +-+ AAAAAA                                                      +-+   
         | B       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA   A                |   
         +  BBBBBBBBBBBBBBBBB     +       +       +    AAA AAAAAAAAAAAA    +   
  0.0005 +-+-----+--------B--BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB--+-+   
         0       10       20      30      40      50       60      70      80  
                                              ITERATION #

Then I fine-tuned just the postnet and decoder:

LOSS:
  0.012 +A+-----+--------+-------+--------+-------+-------+--------+-----+-+   
        +       +        +       +        +       +       +        +       +   
        |                                                  eval Loss  A    |   
   0.01 +-+                                             train Loss    B  +-+   
        |                                                                  |   
        |                                                                  |   
  0.008 +-+                                                              +-+   
        |                                                                  |   
        |                                                                  |   
  0.006 +-+                                                              +-+   
        | A                                                                |   
        |                                                                  |   
        |                                                                  |   
  0.004 +B+                                                              +-+   
        |                                                                  |   
        |  A                                                               |   
  0.002 +-BAAA                                                           +-+   
        |  BB AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  A                   |   
        +    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAABAAAAAAAAAAAAAAAAA  +   
      0 +-+-----+--------+-------+--------+-------+-------+--------+-----+-+   
        0       10       20      30       40      50      60       70      80  
                                              ITERATION #

Finally, I took the above model (i.e. LJSpeech with fine-tuned postnet and decoder) and fine-tuned the model above and including the attention layers:

LOSS:
  0.00095 +-+--------+----------+----------+---------+----------+--------+-+   
          A          +          +          +         +          +          +   
   0.0009 AAA                                        eval Loss        A  +-+   
          |AAAA                                     train Loss        B    |   
  0.00085 +-+ AAAAA                                                      +-+   
          |     AAAAAAAA                                                   |   
   0.0008 +-+        AAAAAAA                                             +-+   
          |              AAAAAAAAA                                         |   
  0.00075 +-+                 AAAAAAAAAAAAAAA                            +-+   
          |                              AAAAAAAAAAAAA AAA                 |   
          |                                       A AAAAAAAAAAAAAAAAAAAA   |   
   0.0007 +-+                                                A  A AAAAAAA+-+   
          |                                                                |   
  0.00065 +-+                                                            +-+   
          BBBB                                                             |   
   0.0006 BBBBBBB B                                                      +-+   
          |B BBBBBBBBBBBBBBBB BBB   B  BB                                  |   
  0.00055 +-+     BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB  BBB  BBBB BBBBB BB+-+   
          +          +   BBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB  +   
   0.0005 +-+--------+----------+--------B-+----B-BB-BBB-BBB--BBBBBBBBBBB+-+   
          0         100        200        300       400        500        600  
                                              ITERATION #

Here’s the same graph above but more detailed and on tensorboard:

This last model was training pretty steadily, but slowly, and with bad attention alignment:

I’ve found some bad transcripts I’m going to fix (less than 500 out of 20,000), and then I’m going to fine-tune this most recent checkpoint, updating the encoder as well.

My original thought was to not touch the embedding layer or encoder, as that would be a text-based representation, and I just want to change the voice, but for the same language (English). However, I read in the DeepVoice2 paper (funny enough, they made Tacotron2 before Google!) they injected speaker-specific info in the decoder only, and that performed badly. They had to add that information to the encoder to get good multi-speaker results.

how do you use Deep Speech to clean TTS dataset? Is it you give the voice and check the generated transcript?

is the dataset multi-speaker?

I can tell that if you input speaker embedding to encoder, that does not work perfectly as well. The alignment gets broken for some speakers. I guess the better way is to add embedding to encoder and decoder together. I am also working on to enable multi-speaker.

PS: What is the library printing loss on the terminal ? that looks cool!

I’m using the Windows Speech recognition to find audio location of the sentences generated with spacy, then with the obtained audio/transciption I use DS to mark as valid where the transcription is the same. Never using related text for the lm, that can bias the results.

Now only using the cleaned audios using the checkpoint(taco2 260k) looks better.
Now the punctuation starts to work properly, I still need to find the source of the silence at the end.
280710.zip (471,6 KB)

Hi @erogol – the dataset isn’t multi-speaker, no.

However, it seems there were some mis-aligned pairs of <utterance,transcript> in the data, fixing those has improved things.

I’m interested in the multi-speaker approach, though. I think it could be a nice base model to use to adapt to new speakers when you have about 30 minutes of data. thoughts?

hey @carlfm01, can you explain this issue some more? I see similar padding, but I thought this was normal for batch training, no?

@erogol, am I missing something?

Hello, I don’t know if it is an issue, even with the cleaned set and silence trimmed I still see the padding.

Same question here.

Sometimes the padding goes away, but as soon as it appears again the alignment cracks.

it is about batching

@erogol – Is there anything in particular to optimize for padding? any hyperparams? I guess padding will be less with smaller batches, because the variation within a batch will be less.

Batching is done on similar-length audio files, correct? That is, the examples within a single batch are more similar in length than a random sample

there is only batch_group_size to play with. As you increase it, you’d see more padding since it performs shuffle in similar length sentences.

2 Likes