The challenges of aligning spoken word to text unobserved

Ole_Klett · January 5, 2021, 4:21pm

There are some voices I want to have for tacotron. For instance the voice of Elvis.

There are some challenges I see:

There are no transcripts. I might be able to use cmusphinx to get some rudimentary detection after having split a mostly background noise free recording at the times of silence. If recognition is good, this gives me at least good alignment. But it lacks punctuation. And silence does not always mean “end of sentence”.
There are transcripts. But more like a book. There’s the question of how to align voice to a book and where to split. I could again use cmusphinx with some…it’s getting complicated.
There are transcripts. But they are only roughly aligned. Sometimes more words are being said, sometimes less than what is written.

How would you approach each of those cases to get the best preparation for tacotron2?

dkreutz · January 5, 2021, 6:59pm

There are a number of methods for transcribing audio and cmusphinx (or another STT tool) might help a little bit. In the end you still have to review each transcription text if it matches EXACTLY the spoken phrase in the audio file.
And you need clean audio recordings (no background noise) and consistent tonality. Don’t know what you have exactly in mind for the “Elvis”, but using audio clips from live recordings or movies will most likely not get you a good dataset.

The wiki section on datasets might give you some ideas what is required:

Ole_Klett · January 5, 2021, 8:32pm

I like your reply. Very helpful. Can you elaborate a bit on the methods that are most promising?

Going through half an hour of audio might be feasible, but 20? Maybe if you delegate to 40 people…

nmstoker · January 6, 2021, 1:06am

This document from Microsoft gives some general points about the challenges regarding preparing samples for creating a custom voice when using specifically recorded voice samples. Clearly it’s written with the idea of using their voice services, but the principles are fairly applicable to any TTS approach:

In cases where you’ve at least got clear audio and some form of transcript, it may be possible to use what’s called “forced alignment” - this takes the transcript and tries to figure out which sections of the audio correspond best to the text. There are several tools that do it, one I’ve tried with good results is Aeneas.
But the challenges @dkreutz mentions still remains - if a forced aligner has helped align the audio and text, you still need to verify it and that’s the tough part!

An example I came across recently, that’s like your point 3, was that dates can be read in different styles: even though I had a good alignment in general, from the text there was sometimes no way to be sure how the speaker had chosen to say it and in my case sometimes they said it in an American way (eg May 4th 2004 as “May fourth, two thousand four”) and others in a more English manner (“May the fourth, two thousand and four”). My solution then was to use grep to identify samples with dates and then listen to a load of them manually and adjust when the spoken words werw different to how my text normalisation had been applied. As a compromise I went for the easiest cases only, and I may further refine the dataset in this manner later, now that I know it gets passable results.

A guide I found helpful in this topic is here: https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The stated aim there is for a speech recognition dataset, but if you’re using good enough quality audio the same approach can work well for preparing a TTS training set. Bear in mind that “found” audio will be harder (possibly impossible) to use to get the best quality voice models from, although it’s clearly a more accessible option than hiring professional voice talent.

Ole_Klett · January 7, 2021, 6:39pm

Don’t know what you have exactly in mind for the “Elvis”

This: Low background noise given the recording equipment, 44 minutes: Elvis interview; July 1972 - unknown location - YouTube

dkreutz · January 8, 2021, 10:07am

This will give you roughly 40 minutes of a mumbled voice. I doubt that this will be enough to train a good model with Tacotron. For comparison: LJSpeech and Thorsten-DE dataset have 20+ hours high quality audio.

My personal opinion: the speaking voice of Elvis isn’t that charismatic, it is his singing voice that made him succesful.

Ole_Klett · January 8, 2021, 10:17am

See it as a Herausforderung.