Given the number of variables I think it’s important that we try to share findings to help build up a general sense of how they interact and what generally good vs less good settings are. However at the same time we should take care not to completely rule out some choices just because they’ve been less effective with certain datasets - I suspect that what works for one dataset won’t always be the best for another dataset. Empirically testing so far as one can is the safest (although clearly there are practical limits there).
Adding yet more challenge is that what works well at one point in the repo history isn’t always necessarily the case later: I have found myself spending quite a bit of time recently trying to find the best settings for my own dataset with a really recent commit from the dev branch, so far finding many of the variations seem to be giving me worse outcomes than I had with a commit from February. When I have the new vocoder trained I’ll be able to get an overall picture (as right now it’s just results with GL that I’m having to rely on).
I’ll write up my various runs soon, but I was trying out the new DDC feature. I think I still need to explore a bit more about the normalisation as I’m now getting a lot more distortion in output audio samples with messages about the audio being clipped during training - this has happened both with DDC and when I turned it off. I have seen these before but they’re much more prevalent currently.
One other thing I was looking into is whether the model benefits from reducing the phoneme alphabet to just those phonemes that appear in the language being trained. English uses around 40-50 phonemes, so the extra ones seem superfluous. My Intuition was that cutting the phoneme alphabet would help but with my initial results it seems like it makes very little measurable impact (a tiny reduction in the model parameter count, making it about 0.1% smaller, and there’s a small reduction in training time). Assuming this holds up when I test it further, then my suspicion is that the model quickly learns to ignore the unused phonemes.