I’ve trained on LJSpeech to confirm a working set up.
I’ve created a custom dataset of ~15K utterances. AnalyzeDataset looks good after I filtered out outliers (> length 63 characters). CheckSpectrograms also checks out. My dataset is in LJSpeech format.
I think the best way to summarize the performance is this: after 20K iterations, all 4 test files in test_audios/2XXXX are the same length and the same length as the first 4 test files in 1XXX. In contrast, LJSpeech quickly shows divergence in files and file lengths.
I’m a bit stumped after having carefully checked file formats/sampling rates, etc. I know my data isn’t completely clean, but I would have expected some result. FWIW I’m using subtitle data for the character Cartman from South Park.
I’m pretty confident the audio parameters are correct, though I’ll try again with the latest commits.
Background noise/silence is definitely an issue but a relatively rare one I think. This dataset is dialogue, so there are 1 word turns, e.g. “No!”, as well as pauses in the middle of a turn, e.g. “Well… I guess so.”
I’m looking at improving the word level alignments and manually cleaning. Any suggestions on dealing with this type of data (DVD audio/subtitle data) would be appreciated.
I trained a custom dataset and on my initial attempts was disheartened as I faced similar issues with it, even though I got good results with LJSpeech.
YMMV, but what I found, after doing clean up along the lines of @erogol’s third suggestion was that for quite a while during training it still gave terrible results (all the same kind of noise, as you seem to be observing) but then quite rapidly hit a point where it dramatically improved - not into perfect speech but something much more speech-like and distinct. I’m not at my computer and don’t recall the exact point but I think it was around the 20-30k point and things started to get pretty decent by 100k - so maybe you’re not going quite far enough?
The other thing that helps give a bit more feedback (which you may have been doing) is using Tensorboard as you can see how the results improve more clearly than just running it and checking the test files manually.
One last point: I had to throw out a bunch of my audio as it was incorrectly clipped. I wonder if perhaps with the subtitles you might have cases where the text doesn’t actually match what was said (it’s quite common for subtitles to be shortened from the spoken words) - would be hard to check but that could be throwing the model off possibly.
One more additional trick to find bad instances is that to run your trained network on your training data and see the instances which have the highest loss value. These would possibly corresspond to the worse instances in your dataset. Do this until you are satisfied with the quality of your worse instances.
I redid the forced alignment and put some conservative filters on the output. The light blue line is my current run. Already it is further along (in sound quality) than any of my previous runs. Once I know this is working, I’ll link the code here
Just responded in another thread: I’m doing a run on a 1080ti right now, and it’s done about 60K iterations in 15 hours. The speech is somewhat intelligible but not great yet I wouldn’t say there’s been a tipping point; phonemes/syllables have slowly emerged from the beginning of the test utterance and increased in length over time.
FWIW I’ve decided I have to manually clean my data, so I’ve built a web-based tool to facilitate that process. Here’s the blog post with links to GitHub and walk-through video: https://olney.ai/category/2019/06/18/manualalignment.html
Once I’ve cleaned it up, I’ll post back to this thread with the training results.
I’ve only cleaned about a third of my data, but it’s definitely working with about 2500 utterances. Major cleaning things I’ve done:
Stop/end times. There may be some silence at either end, but only about 10 ms on average
Noise. Any kind of noise I coded and separated. Music, hum, crickets, etc.
Yelling/pleading/unusual vocalizations. If it wasn’t subjectively normal speech, I separated it out.
Splitting larger utterances. Any utterance with pauses where the prosody seemed reasonable for splitting, I split. I was aggressive in this, since my original length distribution had a long tail.
Fixing incorrect transcript (obvious)
I’m using the manual cleaning webpage I linked before
Up to now, I’ve been using a training regime where I’d repeatedly add 500 utterances and continue a previous model run. However, I’m finding a new model run with all utterances has much better performance.