Recent experiments with DDC + other settings

I wanted to post some details of recent runs, to see how they were consistent or different for others doing similar runs.

Prior to the last few weeks, I’d been working on a copy of the TTS repo from back in late February, and with that I’d got to the point where I had pretty good results (in general and in particular when used with the external MelGAN from ParallelWaveGAN.

However I did have some issues with longer sentences and I was keen to try out the most recent developments with DDC and the integrated vocoder. Therefore I updated with the latest code from dev and tried various runs.

Initially I’ll post results here relating to the main TTS model and I plan to add more detail covering the integrated vocoder shortly (as that needs more runs first, so it may be about a week or so).

Obviously with using the dev code, it’s a little more “at the forefront” and I knew to expect a few bumps on the way! :slightly_smiling_face:

Tensorboard charts

Note: I’ve turned off the last two as they’re for the integrated vocoder and the differing chart types will confuse things here)

Summary of settings

Colour on charts Not on Chart Orange Mid Blue Dark Red Light Blue Pink Green Grey Orange (blip at 300k)
Late Feb July -->
Phon set Full Redux Full Full Redux Redux Redux Full (ignore name!) Full (ignore name!)
DDC No Yes, ddc_r = 1 Yes, ddc_r = 1 Yes, ddc_r = 1 Yes, ddc_r = 7 Yes, ddc_r = 7 Yes, ddc_r = 7 No No
fft_size 1025 1024 1024 1024 1024 1024 1024 1024 1024
win_len N/A 1024 1024 1024 1024 1024 1024 1024 1024
hop_len N/A 256 256 256 256 256 256 256 256
frm_len 50 null null null null null 50 null null
frm_shft 12.5 null null null null null 12.5 null null
Stats path N/A Yes Yes Yes Yes No No Yes Yes
Pre-emph 0.98 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00
Ref Lvl Db 25 0 0 0 20 25 25 25 25
Mel FMin 0.0 50.0 50.0 50.0 50.0 40.0 40.0 40.0 40.0
Mel FMax 8k 7.9k 7.9k 7.9k 8k 8k 8k 8k 8k
Spec Gain N/A 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Steps 580k 170k 60k 40k 70k >10k >40k 370k ~0


Before first successful runs

Initially had difficulties getting it to run without running out of memory almost immediately on my GTX 1080Ti.

At that point I hadn’t realised that I’d left the ddc_r = 1 value that was in the config I’d taken from the CoLab (in hindsight: clearly best to start with config from the repo instead :slightly_smiling_face: ) I just assumed that using two decoders for DDC would need much more memory (likely true but not at this level).

I tried reducing the batch sizes used for the regular decoder. That only worked when they were all down to 16 (which has other negative implications but I decided to press on at that point).

“Orange 1”

My first July run. The audio was really distorted on the peaks, with that aspect not improving with time as I ran it longer.

Log output was indicating a lot of warnings of “audio amplitude out of range, auto clipped” - those have come up for me before but saw a lot more than usual and they didn’t improve given time.

I had applied the stats_path normalisation. Owing to the distortion I wondered if it was causing issues with my dataset, so I tried varying a number of things with subsequent runs.

Was also interested in the idea that not all phonemes are used in English, so I experimented with cutting the phoneme character list defined in config.json to be just those strictly needed for the model. In the next run I returned to the full phoneme set to see if this had somehow caused or exacerbated the distortion.

Mid Blue

Although I returned to using the full set of phonemes, I had accidentally left in place the cached phonemes from the reduced set, so this was a useless run - the audio was jibberish (which I only realised when I checked hours later)

Dark Red

Audio still fairly distorted; stopped it early as realised the ddc_r issue (fixed in next run)

Light Blue

Fixed the ddc_r issue, setting it to 1.

Alignment 2 (ie for ddc) was markedly better now as expected.

Audio still badly distorted

Pink and Green

These both had problems that I couldn’t pin down. The audio in both was almost non-existent, aside from background noise. The charts for Pink and Green looked odd too.

Pink was worse as it had stopping failures (so samples ran on for 59 seconds), with Green they returned to more normal time ranges (eg 2-3 seconds). They both had the reduced phoneme set.

With Green I tried the frame length and shifts I’d had in my previous best run (from late Feb-May series of runs).

Grey and Orange 2

For Grey I gave up on the phoneme idea.

I let it run for quite some time and Grey is the best of my July runs so far.

It’s still clearly distorted, but if you ignore the peaks, it has quite good qualities (although a long way from usable)

I was going to continue training with Orange 2 but decided I had better explore whether the integrated vocoder could somehow counteract the distortions, so stoped Orange 2 and switch to look at the vocoder. I plan to write that up once I’ve done more runs and having findings to share.

I’m still interested to see if I can figure out what to do around normalisation of the audio. I wonder if for my dataset somehow the stats_path approach is too much. I hope to try plotting some of my audio samples before and after to explore it further.

I know voice audio can be aymmetric so I wonder if the higher amplitude side is being amplified enough to go out of range whilst the lower side is still well within range (this is part of the motivation behind my desire to plot the audio samples).

Hope that’s of interest. If you’ve had similar (or different!) experiences, would be great to hear about them.


By the way: if viewing this on a large screen desktop and you find the table’s a bit squashed (as the desktop mode of the forum’s layout is rather narrow) you can view it in responsive mode, which should use the full width, with this link where the parameter ?mobile_view=1 is appended:

NB: You’ll need to resubmit the page address in the browser address bar (ie it doesn’t work just by clicking the link)

Quick update: After a fair bit of experimenting, I’ve got some pretty good results with the integrated Vocoder with an example reading of a recipe here:

The TTS model used here wasn’t trained with DDC but it manages to be fairly stable all the same. The Vocoder ran for 1,000,000 steps and is pretty smooth - I plan to do a little more experimentation to see if I can fine-tune just a bit more quality out of it.

Overall, I’m really pleased. There is still the occasional glitch with a word. Generally glitches seem most prevalent with words when they are used in speech patterns that didn’t come up in my training. When correct phonemes are supplied by espeak-ng it can easily go beyond the original vocabulary fairly well, it’s more the way the words are used that seems the issue. A limit of my original TTS model training was on the length of short words used and that definitely shows - it can’t say “Good.” on its own, but is fine with it in a sentence!

Lists, questions, imperatives and splitting up clauses seem to be reasonably well handled too.