It’s been slow going so far , @erogol.
I took the trained (on LJSpeech) Tacotron2 model from 824c091, and began fine-tuning on an in-house ~15 (male, professionally recorded) dataset.
At first, I fine-tuned all the weights to the new data with a learning rate of 0.00001
, and "lr_decay": false
. This lead to no real learning and garbage output.
You can see in the below graph that both training and eval Loss curves flatline quickly:
LOSS:
0.004 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+
+ + + + + + + + +
0.0035 A-+ eval Loss == A +-+
| train Loss == B |
| |
0.003 +-+ +-+
| |
0.0025 +-+ +-+
| |
0.002 +-+ +-+
| |
| |
0.0015 A-+ +-+
AA |
0.001 +AA +-+
BBAAAAAAAAAAAAAA A |
|BBBBBBBBBBBB AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA |
0.0005 +-+ BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +-+
+ + + + + + + + +
0 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+
0 50 100 150 200 250 300 350 400
ITERATION #
Then I tried only fine-tuning the postnet
from the original pre-trained model, with not great results:
LOSS:
0.004 +-+-----+--------+-------+-------+-------+--------+-------+-----+-+
+ + + + + + + + +
| eval Loss A |
0.0035 +-+ eval Loss B +-+
| |
0.003 +-+ +-+
| |
| |
0.0025 +-+ +-+
| |
| |
0.002 +-+ +-+
| |
| |
0.0015 +-A +-+
|BAA |
0.001 +-+ AAAAAA +-+
| B AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A |
+ BBBBBBBBBBBBBBBBB + + + AAA AAAAAAAAAAAA +
0.0005 +-+-----+--------B--BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB--+-+
0 10 20 30 40 50 60 70 80
ITERATION #
Then I fine-tuned just the postnet
and decoder
:
LOSS:
0.012 +A+-----+--------+-------+--------+-------+-------+--------+-----+-+
+ + + + + + + + +
| eval Loss A |
0.01 +-+ train Loss B +-+
| |
| |
0.008 +-+ +-+
| |
| |
0.006 +-+ +-+
| A |
| |
| |
0.004 +B+ +-+
| |
| A |
0.002 +-BAAA +-+
| BB AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A |
+ BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAABAAAAAAAAAAAAAAAAA +
0 +-+-----+--------+-------+--------+-------+-------+--------+-----+-+
0 10 20 30 40 50 60 70 80
ITERATION #
Finally, I took the above model (i.e. LJSpeech with fine-tuned postnet
and decoder
) and fine-tuned the model above and including the attention
layers:
LOSS:
0.00095 +-+--------+----------+----------+---------+----------+--------+-+
A + + + + + +
0.0009 AAA eval Loss A +-+
|AAAA train Loss B |
0.00085 +-+ AAAAA +-+
| AAAAAAAA |
0.0008 +-+ AAAAAAA +-+
| AAAAAAAAA |
0.00075 +-+ AAAAAAAAAAAAAAA +-+
| AAAAAAAAAAAAA AAA |
| A AAAAAAAAAAAAAAAAAAAA |
0.0007 +-+ A A AAAAAAA+-+
| |
0.00065 +-+ +-+
BBBB |
0.0006 BBBBBBB B +-+
|B BBBBBBBBBBBBBBBB BBB B BB |
0.00055 +-+ BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB BBB BBBB BBBBB BB+-+
+ + BBBBB BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +
0.0005 +-+--------+----------+--------B-+----B-BB-BBB-BBB--BBBBBBBBBBB+-+
0 100 200 300 400 500 600
ITERATION #
Here’s the same graph above but more detailed and on tensorboard
:
This last model was training pretty steadily, but slowly, and with bad attention alignment:
I’ve found some bad transcripts I’m going to fix (less than 500 out of 20,000), and then I’m going to fine-tune this most recent checkpoint, updating the encoder as well.
My original thought was to not touch the embedding layer or encoder, as that would be a text-based representation, and I just want to change the voice, but for the same language (English). However, I read in the DeepVoice2 paper (funny enough, they made Tacotron2 before Google!) they injected speaker-specific info in the decoder only, and that performed badly. They had to add that information to the encoder to get good multi-speaker results.