It’s been slow going so far , @erogol.
I took the trained (on LJSpeech) Tacotron2 model from 824c091, and began fine-tuning on an in-house ~15 (male, professionally recorded) dataset.
At first, I fine-tuned all the weights to the new data with a learning rate of 0.00001, and "lr_decay": false. This lead to no real learning and garbage output.
You can see in the below graph that both training and eval Loss curves flatline quickly:
LOSS:
   0.004 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+   
         +       +        +       +       +       +        +       +       +   
  0.0035 A-+                                 eval Loss == A              +-+   
         |                                   train Loss == B               |   
         |                                                                 |   
   0.003 +-+                                                             +-+   
         |                                                                 |   
  0.0025 +-+                                                             +-+   
         |                                                                 |   
   0.002 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
  0.0015 A-+                                                             +-+   
         AA                                                                |   
   0.001 +AA                                                             +-+   
         BBAAAAAAAAAAAAAA A                                                |   
         |BBBBBBBBBBBB  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA      |   
  0.0005 +-+      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB    +-+   
         +       +        +       +       +       +        +       +       +   
       0 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+   
         0       50      100     150     200     250      300     350     400
                                              ITERATION #
Then I tried only fine-tuning the postnet from the original pre-trained model, with not great results:
LOSS:
   0.004 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+   
         +       +        +       +       +       +        +       +       +   
         |                                               eval Loss  A      |   
  0.0035 +-+                                              eval Loss  B   +-+   
         |                                                                 |   
   0.003 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
  0.0025 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
   0.002 +-+                                                             +-+   
         |                                                                 |   
         |                                                                 |   
  0.0015 +-A                                                             +-+   
         |BAA                                                              |   
   0.001 +-+ AAAAAA                                                      +-+   
         | B       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA   A                |   
         +  BBBBBBBBBBBBBBBBB     +       +       +    AAA AAAAAAAAAAAA    +   
  0.0005 +-+-----+--------B--BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB--+-+   
         0       10       20      30      40      50       60      70      80  
                                              ITERATION #
Then I fine-tuned just the postnet and decoder:
LOSS:
  0.012 +A+-----+--------+-------+--------+-------+-------+--------+-----+-+   
        +       +        +       +        +       +       +        +       +   
        |                                                  eval Loss  A    |   
   0.01 +-+                                             train Loss    B  +-+   
        |                                                                  |   
        |                                                                  |   
  0.008 +-+                                                              +-+   
        |                                                                  |   
        |                                                                  |   
  0.006 +-+                                                              +-+   
        | A                                                                |   
        |                                                                  |   
        |                                                                  |   
  0.004 +B+                                                              +-+   
        |                                                                  |   
        |  A                                                               |   
  0.002 +-BAAA                                                           +-+   
        |  BB AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  A                   |   
        +    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAABAAAAAAAAAAAAAAAAA  +   
      0 +-+-----+--------+-------+--------+-------+-------+--------+-----+-+   
        0       10       20      30       40      50      60       70      80  
                                              ITERATION #
Finally, I took the above model (i.e. LJSpeech with fine-tuned postnet and decoder) and fine-tuned the model above and including the attention layers:
LOSS:
  0.00095 +-+--------+----------+----------+---------+----------+--------+-+   
          A          +          +          +         +          +          +   
   0.0009 AAA                                        eval Loss        A  +-+   
          |AAAA                                     train Loss        B    |   
  0.00085 +-+ AAAAA                                                      +-+   
          |     AAAAAAAA                                                   |   
   0.0008 +-+        AAAAAAA                                             +-+   
          |              AAAAAAAAA                                         |   
  0.00075 +-+                 AAAAAAAAAAAAAAA                            +-+   
          |                              AAAAAAAAAAAAA AAA                 |   
          |                                       A AAAAAAAAAAAAAAAAAAAA   |   
   0.0007 +-+                                                A  A AAAAAAA+-+   
          |                                                                |   
  0.00065 +-+                                                            +-+   
          BBBB                                                             |   
   0.0006 BBBBBBB B                                                      +-+   
          |B BBBBBBBBBBBBBBBB BBB   B  BB                                  |   
  0.00055 +-+     BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB  BBB  BBBB BBBBB BB+-+   
          +          +   BBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB  +   
   0.0005 +-+--------+----------+--------B-+----B-BB-BBB-BBB--BBBBBBBBBBB+-+   
          0         100        200        300       400        500        600  
                                              ITERATION #
Here’s the same graph above but more detailed and on tensorboard:
This last model was training pretty steadily, but slowly, and with bad attention alignment:
I’ve found some bad transcripts I’m going to fix (less than 500 out of 20,000), and then I’m going to fine-tune this most recent checkpoint, updating the encoder as well.
My original thought was to not touch the embedding layer or encoder, as that would be a text-based representation, and I just want to change the voice, but for the same language (English). However, I read in the DeepVoice2 paper (funny enough, they made Tacotron2 before Google!) they injected speaker-specific info in the decoder only, and that performed badly. They had to add that information to the encoder to get good multi-speaker results.