DeepSpeech Accuracy Issues and Newbie Questions

ljohnson · August 15, 2018, 2:01pm

Hello,

I hope I am posting this to the right place. I am a dev ops engineer and have just started delving into the world of Deep Learning over the last month or two and have been trying to set up a server running DeepSpeech in AWS. After a lot of troubleshooting, I was able to get DeepSpeech running (without training). This is where things got confusing for me. So that you can understand what I am experiencing, I’ll list the steps I took for training below:

I imported the Librivox data set (this appeared to be the largest data set and appeared to be what DeepSpeech was tested on) using the import_librivox.py script.
Once the data was imported, I started the training process using the run-librivox.sh script and backgrounded it as I didn’t want to babysit it since I knew it would take a while. I logged all output to a log file that I could view in case there were errors.
Due to the size of our system, I had to stop the training and resize our server on a number of occasions to decrease training time.
At one point I even had to move to a completely new server so I could use the correct AMI. This required transferring all data, checkpoints, and other needed files to the new server. After which, I restarted the training process on the new server.

NOTE: sometimes restarting the server and then restarting the training would result in the epoch being increased. If I changed the instance class type sometimes the epoch would get completely reset back to 1. This was very confusing and alarming and made me wonder if there was an issue with the training.

Once training finally completed, I exported an output graph so I could use it for running inferences.
Then I ran inferences. Running inferences on a test file I was given by my employer, however, resulted in serious accuracy issues (the result wasn’t even remotely close to the actual transcript).

I realized that the issue was either with the training and how I performed it, or the issue was with the audio file, which contained background noise, and an audio glitch at the beginning. It’s worth noting that the audio file was a recording to two people speaking to each other.

To investigate the accuracy issues and any other issues I encountered, I checked the documentation in the GitHub repo, checked existing questions here on Discourse, ran Google searches for possible solutions, etc.

I tried running a one-shot inference to see if the exported output graph was corrupted somehow. This produced the exact same result as the output graph. I then tried ensuring that the audio had been properly converted from mp3 format to wav. This produced the same aforementioned result.

Here is the result I got from our test file:

but at tereisorhaidamandoyesisodaynathetousmotthatuwsthegarktaniyetwalrcaspolkthantanaturalbytohisface joyingitverycoolveryule did you get that after a garventigotthatattesipiredmenthe y forty to gottarpegoten isersweethituastwotcaste and be firneyouralyniceanthatthatmygoatyoussugarditike exttibebolntecoletohisanmay it is could be a pentovaniathisnactonalla yhisvathesagane e as the war is so peter and joe and may be dan i othavemylalwafhetrso’getopocularbi

I also tried testing other audio files of similar or shorter length (that were not part of the training data set) and found that files in which there was less background noise and in which speech was more enunciated had better (but not perfect) results. Also, these new files were just one person speaking instead of two like my original test file.

So with no luck in tracking down the issue, I turn to Discourse for help.

All of this in mind and given that I am very new to Machine Learning, I am left with the following questions:

Do the language model and trie files need to be built from the same dataset that you train on? It is my understanding that the vocab.txt is used to create these and that the vocab.txt is comprised of transcripts from dataset used for training.
When you pause training and pick it up again does it truly pick up where you left off?
Right now, to pause training, i just kill the process. Is this the correct way to do this?
Are the checkpoints representative of the network’s training? Can they be transferred from one machine to another to preserve training?
If you transfer checkpoints from one machine to another, can you resume training properly on the new machine?
Can audio artifacts (background noise, ambient sound, accents, audio glitches, etc) affect inferences and their accuracy? If so, how can we increase accuracy in cases like this?
How does training actually work under the hood?
How do inferences work under the hood?
Is it accurate to say that the more data in your data set the more accurate the inferences will be?
What are the best practices for performing training? (I don’t want to babysit the server during training)
How do you set up a cluster of machines? Since training takes so long (even with a massive machine that 16 GPUs) I want to set up a cluster of machines to make the process faster. The documentation shows how to create a local cluster on the master machine, but not a cluster of multiple remote hosts.
Does training work best when run uninterrupted?
Since we use AWS, do we need to refrain from changing instance class types after pausing training?

I know there is a lot of information and questions here, but I wanted to make sure that you had as much info as possible. Any help you can provide with regard to the accuracy issues and questions I have added above would be greatly appreciated. If you need any additional info, please let me know. Thanks so much in advance for your help.

ljohnson · August 17, 2018, 12:39pm

Hello,

Wanted to reach back out as I had not yet seen any replies. If anyone has any ideas or can answer any of the questions in my post above, that would be greatly appreciated. Thanks.

reuben · August 17, 2018, 1:26pm

I’m gonna try to take this question by question:

They don’t, generally you’ll get a better language model by using much larger datasets of pure text than just the text available in the transcriptions of a speech dataset.

The --checkpoint_dir flag specifies where checkpoints are kept. If the directory containing checkpoints is preserved between training runs, training will continue. If you don’t specify it, it’ll point to ~/.local/share/deepspeech/checkpoints.

AFAIK it’s the only way to do it.

Yes and yes.

Yes. The epoch number might get messed up if the training data is changed between two runs.

Yes, definitely. You can improve it by training on data that closely matches the distribution of the data you plan to do inference on. If you want to transcribe noisy speech, train on noisy speech, if you want to transcribe conversational speech, train on conversational speech, etc.

That’s quite a broad question, could you be more specific?

To an approximation, yes. It’s more about the diversity of the data than just the quantity.

You shouldn’t have to babysit the training process, but it’s not a fire-and-forget thing, it’s very much a process of trying things, tweaking hyperparameters, training again, etc, in search of a good model. Settings that worked for a given dataset might not work for a different dataset.

If you look at the .compute script on current master, it should give you an idea of how the process works. You start a parameter server process on one node, and a worker process on all nodes, and specify the addresses of the nodes in the appropriate flags. AFAIK we don’t have a tutorial or more involved guide, unfortunately.
[/quote]

It shouldn’t make any difference.

I don’t know anything about AWS, so can’t help you here.

ljohnson · August 20, 2018, 8:58pm

@reuben,

Thanks so much for your response. The information you provided is a monumental help.

Let me be more specific as to my goal here. Perhaps the two questions asked here were too open ended. I thought that if I understood how DeepSpeech works a little better (how it uses the output graph and audio file to determine what said audio file’s transcript is, for example) it would help me to track down the accuracy issues I’m seeing.

Specifically, I want to understand what would cause the long strings of seemingly random/psuedo-random characters with no spaces. As you can see from the above output I’m getting a few of them. Apart of from the audio file itself, is there anything that could cause this? I read in another thread here on Discourse, linked here, that this may be a known issue with DeepSpeech, but I want to know if there is anything I can check either when doing training (like settings, for example) or when doing inferences that can help alleviate this issue.

Thanks again for your help.

ljohnson · August 22, 2018, 12:36pm

Hello,

I just wanted to follow up as I haven’t seen a reply to my most recent post above in a couple of days. My goal with resolving the accuracy issues is to get to a point where we have the same level of accuracy as shown in the Baidu research paper. Is there enough audio data available in the sets provided by DeepSpeech to get us to that point? Is the audio data available in the sets provided by DeepSpeech diverse enough to get us to that point? If not, are the sets intended more to be starting points where we are expected to add our own data and retrain periodically?

Any help you can provide with the accuracy issues would be greatly appreciated. Thanks.

reuben · August 22, 2018, 1:20pm

If you’ve read the Baidu paper, you know how it works

The long strings of words glued together is a known issue: https://github.com/mozilla/DeepSpeech/issues/1156

We’re working on it.

robertritz · August 23, 2018, 3:02am

I enjoy the spoken word sometimes to help digest information. This talk from Adam Coates at Baidu is a good overview of DeepSpeech, how it works, and tips and tricks for getting it to work.