I wanted to post some details of recent runs, to see how they were consistent or different for others doing similar runs.
Prior to the last few weeks, I’d been working on a copy of the TTS repo from back in late February, and with that I’d got to the point where I had pretty good results (in general and in particular when used with the external MelGAN from ParallelWaveGAN.
However I did have some issues with longer sentences and I was keen to try out the most recent developments with DDC and the integrated vocoder. Therefore I updated with the latest code from dev and tried various runs.
Initially I’ll post results here relating to the main TTS model and I plan to add more detail covering the integrated vocoder shortly (as that needs more runs first, so it may be about a week or so).
Obviously with using the dev code, it’s a little more “at the forefront” and I knew to expect a few bumps on the way!
Note: I’ve turned off the last two as they’re for the integrated vocoder and the differing chart types will confuse things here)
Summary of settings
|Colour on charts||Not on Chart||Orange||Mid Blue||Dark Red||Light Blue||Pink||Green||Grey||Orange (blip at 300k)|
|Late Feb||July -->|
|Phon set||Full||Redux||Full||Full||Redux||Redux||Redux||Full (ignore name!)||Full (ignore name!)|
|DDC||No||Yes, ddc_r = 1||Yes, ddc_r = 1||Yes, ddc_r = 1||Yes, ddc_r = 7||Yes, ddc_r = 7||Yes, ddc_r = 7||No||No|
|Ref Lvl Db||25||0||0||0||20||25||25||25||25|
Before first successful runs
Initially had difficulties getting it to run without running out of memory almost immediately on my GTX 1080Ti.
At that point I hadn’t realised that I’d left the ddc_r = 1 value that was in the config I’d taken from the CoLab (in hindsight: clearly best to start with config from the repo instead ) I just assumed that using two decoders for DDC would need much more memory (likely true but not at this level).
I tried reducing the batch sizes used for the regular decoder. That only worked when they were all down to 16 (which has other negative implications but I decided to press on at that point).
My first July run. The audio was really distorted on the peaks, with that aspect not improving with time as I ran it longer.
Log output was indicating a lot of warnings of “audio amplitude out of range, auto clipped” - those have come up for me before but saw a lot more than usual and they didn’t improve given time.
I had applied the stats_path normalisation. Owing to the distortion I wondered if it was causing issues with my dataset, so I tried varying a number of things with subsequent runs.
Was also interested in the idea that not all phonemes are used in English, so I experimented with cutting the phoneme character list defined in config.json to be just those strictly needed for the model. In the next run I returned to the full phoneme set to see if this had somehow caused or exacerbated the distortion.
Although I returned to using the full set of phonemes, I had accidentally left in place the cached phonemes from the reduced set, so this was a useless run - the audio was jibberish (which I only realised when I checked hours later)
Audio still fairly distorted; stopped it early as realised the ddc_r issue (fixed in next run)
Fixed the ddc_r issue, setting it to 1.
Alignment 2 (ie for ddc) was markedly better now as expected.
Audio still badly distorted
Pink and Green
These both had problems that I couldn’t pin down. The audio in both was almost non-existent, aside from background noise. The charts for Pink and Green looked odd too.
Pink was worse as it had stopping failures (so samples ran on for 59 seconds), with Green they returned to more normal time ranges (eg 2-3 seconds). They both had the reduced phoneme set.
With Green I tried the frame length and shifts I’d had in my previous best run (from late Feb-May series of runs).
Grey and Orange 2
For Grey I gave up on the phoneme idea.
I let it run for quite some time and Grey is the best of my July runs so far.
It’s still clearly distorted, but if you ignore the peaks, it has quite good qualities (although a long way from usable)
I was going to continue training with Orange 2 but decided I had better explore whether the integrated vocoder could somehow counteract the distortions, so stoped Orange 2 and switch to look at the vocoder. I plan to write that up once I’ve done more runs and having findings to share.
I’m still interested to see if I can figure out what to do around normalisation of the audio. I wonder if for my dataset somehow the stats_path approach is too much. I hope to try plotting some of my audio samples before and after to explore it further.
I know voice audio can be aymmetric so I wonder if the higher amplitude side is being amplified enough to go out of range whilst the lower side is still well within range (this is part of the motivation behind my desire to plot the audio samples).
Hope that’s of interest. If you’ve had similar (or different!) experiences, would be great to hear about them.