Best config for tacotron2 training

@erogol What is the best config for tacotron2 training. I see that master branch config_tacotron2.json is different from the one that comes with the latest pretrained Tacotron2 model.
In particular what are better choices:

  1. attention_norm: sigmoid vs softmax
  2. prenet_type: original vs bn
  3. loss_masking: true vs false
  4. enable_eos_bos_chars: false vs true

There is no better choice, it is all about the dataset. But it is better to start with the original Tacotron paper. So it means;

I would like to open a discussion about the config.json file included in the master branch. While questions about a “best” configuration may not be answered universally, a consistent exemplary configuration provides a solid foundation for individual experiments.

Towards this goal, I propose to remove or change some confusing elements in the current config file. Right now, an exemplary configuration for a Tacotron2 training with LJSpeech is indicated there. This makes sense considering the “Collaborative Experimentation Guide” advertised in the README, that equally advocates to use LJSpeech for experiments. However, there seem to be a confusing inconsistency between some of the parameters set and their respective comments.

Consider for instance:

"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
If the comment proposes not to use do_trim_silence with LJspeech, the parameters value should be false.

A second example:

"attention_norm": "sigmoid", // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
Set attention_norm to softmax, or indicate why sigmoid should be used if Tacotron2 is trained with LJSpeech.

In my opinion, the problem is not only that there seems to be a mismatch between the parameter value and the comment, but also that it becomes unclear whether the values of other parameters are already adapted to LJSpeech or require further changes. While the inconsistency is obvious in the above examples, it may be the case in others as well, but less obviously.

Independent of the specific training set, this comment does not reflect the source code:

"sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.

Yet, the audio is not resampled, see #405. Instead, maybe recommend to resample before training or remove that part of the comment?

I understand that - as TTS is under active development - knowledge and best practices change frequently, so keeping a consistent, up-to-date config file might be difficult. Alternatively, maybe there could be a section in the wiki devoted to lessons learned regarding Model-Dataset-Configurations and exemplary configurations?

I’m happy to make a PR with regards to the identified parts above. However, maybe my understanding of the whole situation is not on point, so I’m happy to hear your thoughts.

This seems worth a look - I agree that initially I ran into a few of the points you raise (eg with the softmax comment being recommended for Tacotron2 even though most of the config files which appear to be using Tacotron2 generally go against that comment!)

I don’t want this to distract from the @thllwg raises above, but one related point has been in the back of my mind for a while, which I figured I’d mention here for general comment: might there be some easier / better option for handling config files here? (can move this to another thread though if people prefer though?)

I haven’t looked at many other projects on this point specifically but that seems like it’s worth a go doing some research rather than reinventing the wheel.

Q. If there was interest in this area, what features would people most value?

I saw a library recently that claims to handle layered config files (plus a few other things). It’s called Confuse - I haven’t used it, so can’t give a personal testimony but it has a decent(ish) number of stars (~218 as of June 2020).

What seems appealing to me about being able to layer the configs is we could have a base config that includes all the core settings, and then for a broad category (eg Tacotron vs Tacotron2) you have another layer config - but because the library is adding the configs together, you just need the key differences. And then finally with the experiments and people’s individual efforts, you would have a final layer - which in most cases would be very sparse but it would also help to make it clear if there were custom/newly tried settings.

Not sure how necessary this would be, but by encouraging the putting of things like the specific folder paths people use in a final personal layer (or pointing it finally to a ~/.config file, as Confuse can do) it means that often there might nothing (or almost nothing) to edit when you get the latest updates from the repo.

This would add some initial complexity but sounds like it might pay off with greater clarity for new users and some simplicity for the more experienced users too. With the comments, these could then naturallly sit in the relevant places (eg something that was advised to be used with Tacotron2 might just be commented on in the Tacotron2 layer config file)

I’d probably suggest that for the core settings we’d then mainly just have comments that indicate allowed values or the widest limits that would ever be useful (but that’s just an idea)

Confuse can also handle config value validation (that hasn’t typically caught me out much but I’ve had the odd typo I could’ve avoided!)

Of course if others have alternative suggestions, or simply think this isn’t a wise idea (either right now or ever!), do please say :slight_smile:

@thllwg thx for pointing the conflicts. I just overlooked the comments as I experiment with the values and getting better results. I’ll update them regarding the latest config.json. Or feel free to send a PR. It is really hard to keep things consistent as I run experiments every day. However, config.json in the TTS repo is always the best config I used so far even if there are such inconsistencies.

Regarding the whole config situation, I’d favor keeping config files as simple as possible with fewer branching. That would be easier to maintain and distribute the models.

Probably one approach to consider is to release recipes (as in Kaldi) per dataset with a particular config of the best model we trained so far for that dataset. That would at least ease the burden for a novice user.

@nmstoker I’ll also check Confuse. Maybe we can replicate its behaviour internally without introducing a new dependency.

1 Like

I think seperating model configuration (layer configurations and model specific settings) from the rest is a good alternative since as we add new models to TTS ( as I do it now), it gets harder to accommodate all of the models in a single config.

What about creating a separate repo which maintains an universal standard for configs related to TTS (and ASR)? So, anyone who is working on a text-to-speech project can start his configuration based on this standard.

It defines unique names for common parameters required for data processing and well know models. It also suggests defaults. Maybe this makes things more consistent, like “do_trim_silence” in TTS vs “trim_silence” in ESPNet.

This repo may also maintain a best performed config for each model with regard to datasets.

A separate repo with dataset specific runs would be a nice solution. So each dataset can have a folder and people just need to send a shell script downloading and installing everything (dataset, right TTS commit etc.) and run training.

2 Likes