Hello,
I hope I am posting this to the right place. I am a dev ops engineer and have just started delving into the world of Deep Learning over the last month or two and have been trying to set up a server running DeepSpeech in AWS. After a lot of troubleshooting, I was able to get DeepSpeech running (without training). This is where things got confusing for me. So that you can understand what I am experiencing, I’ll list the steps I took for training below:
-
I imported the Librivox data set (this appeared to be the largest data set and appeared to be what DeepSpeech was tested on) using the import_librivox.py script.
-
Once the data was imported, I started the training process using the run-librivox.sh script and backgrounded it as I didn’t want to babysit it since I knew it would take a while. I logged all output to a log file that I could view in case there were errors.
-
Due to the size of our system, I had to stop the training and resize our server on a number of occasions to decrease training time.
-
At one point I even had to move to a completely new server so I could use the correct AMI. This required transferring all data, checkpoints, and other needed files to the new server. After which, I restarted the training process on the new server.
NOTE: sometimes restarting the server and then restarting the training would result in the epoch being increased. If I changed the instance class type sometimes the epoch would get completely reset back to 1. This was very confusing and alarming and made me wonder if there was an issue with the training.
-
Once training finally completed, I exported an output graph so I could use it for running inferences.
-
Then I ran inferences. Running inferences on a test file I was given by my employer, however, resulted in serious accuracy issues (the result wasn’t even remotely close to the actual transcript).
I realized that the issue was either with the training and how I performed it, or the issue was with the audio file, which contained background noise, and an audio glitch at the beginning. It’s worth noting that the audio file was a recording to two people speaking to each other.
To investigate the accuracy issues and any other issues I encountered, I checked the documentation in the GitHub repo, checked existing questions here on Discourse, ran Google searches for possible solutions, etc.
I tried running a one-shot inference to see if the exported output graph was corrupted somehow. This produced the exact same result as the output graph. I then tried ensuring that the audio had been properly converted from mp3 format to wav. This produced the same aforementioned result.
Here is the result I got from our test file:
but at tereisorhaidamandoyesisodaynathetousmotthatuwsthegarktaniyetwalrcaspolkthantanaturalbytohisface joyingitverycoolveryule did you get that after a garventigotthatattesipiredmenthe y forty to gottarpegoten isersweethituastwotcaste and be firneyouralyniceanthatthatmygoatyoussugarditike exttibebolntecoletohisanmay it is could be a pentovaniathisnactonalla yhisvathesagane e as the war is so peter and joe and may be dan i othavemylalwafhetrso’getopocularbi
I also tried testing other audio files of similar or shorter length (that were not part of the training data set) and found that files in which there was less background noise and in which speech was more enunciated had better (but not perfect) results. Also, these new files were just one person speaking instead of two like my original test file.
So with no luck in tracking down the issue, I turn to Discourse for help.
All of this in mind and given that I am very new to Machine Learning, I am left with the following questions:
-
Do the language model and trie files need to be built from the same dataset that you train on? It is my understanding that the vocab.txt is used to create these and that the vocab.txt is comprised of transcripts from dataset used for training.
-
When you pause training and pick it up again does it truly pick up where you left off?
-
Right now, to pause training, i just kill the process. Is this the correct way to do this?
-
Are the checkpoints representative of the network’s training? Can they be transferred from one machine to another to preserve training?
-
If you transfer checkpoints from one machine to another, can you resume training properly on the new machine?
-
Can audio artifacts (background noise, ambient sound, accents, audio glitches, etc) affect inferences and their accuracy? If so, how can we increase accuracy in cases like this?
-
How does training actually work under the hood?
-
How do inferences work under the hood?
-
Is it accurate to say that the more data in your data set the more accurate the inferences will be?
-
What are the best practices for performing training? (I don’t want to babysit the server during training)
-
How do you set up a cluster of machines? Since training takes so long (even with a massive machine that 16 GPUs) I want to set up a cluster of machines to make the process faster. The documentation shows how to create a local cluster on the master machine, but not a cluster of multiple remote hosts.
-
Does training work best when run uninterrupted?
-
Since we use AWS, do we need to refrain from changing instance class types after pausing training?
I know there is a lot of information and questions here, but I wanted to make sure that you had as much info as possible. Any help you can provide with regard to the accuracy issues and questions I have added above would be greatly appreciated. If you need any additional info, please let me know. Thanks so much in advance for your help.