Additional ideas around dataset and TTS output testing

There are already some handy tools in the repo for looking at dataset issues (eg here and here).

However for a while I’ve had an idea in the back of my mind about looking into the syllables present in the audio and comparing that to the transcript text to highlight discrepancies, and seeing if it could be semi-automated to save time.

There are various ways you could do this, including with use of speech recognition on the audio side, but I identified an approach for the audio that works tolerably well (it’s not perfect but seems to work reasonably well).

It’s presented in a Gist here: https://gist.github.com/nmstoker/f1590847a16b66ab22c16722aac1cc51

If people think it might be useful added to the repo, I’m happy to do a PR.

It uses a library called parselmouth in turn calling a Praat script for the audio. For the text syllables there is a handy little library called syllapy

I ran it on LJSpeech 1.1 as that’s what people often use here at least for experimentation. That dataset is a well produced dataset, but it actually did identify one particular case with a clear problem.

For new / private / self-produced datasets this could be a very useful way to avoiding the need to manually inspect each audio/transcript pair. At the very least it lets you initially target such efforts.

And there is also scope on running it on audio output from TTS to see that there aren’t cases of repeating words (ie as often happens when there are stopnet issues). You could create a large-ish batch of new transcript sentences to test, fire these at TTS using requests to create the audio files and then run the comparison between the audio and transcript to focus on problem cases.

With Praat, there are potentially options to go a bit further than purely syllables (eg to use their “voice report” (some details here), so if people have feedback or suggestions before adding this, do fire away :slightly_smiling_face: