Would I be right in thinking that the publicly captured Common Voice data will at some point be used to train models in Mozilla’s DeepSpeech library?
I’ve been able to get Common Voice working locally myself and just recently managed to run the basic training example in DeepSpeech running successfully (on a GPU to boot), so I was thinking I’d take a look at how to wrangle the Common Voice data into the right form to use with DeepSpeech for training.
Is there a plan to do this kind of thing within the Common Voice or the DeepSpeech repos? (or perhaps neither?)
My guess (optimisitcally!) is that this may not be too hard, but thought I’d see whether it was on the cards or even already under way?
We absolutely plan to use the Common Voice data with Mozilla’s DeepSpeech engine. Our goal is to release the first version of this data by the end of the year, in a format that makes it easy to import into project like DeepSpeech.
While this is certainly in the cards, we haven’t started this process yet. Perhaps we can enlist your help once we pick up this work in earnest (probably in the November timeframe)?
That’s great, would be delighted to help if i can @mhenretty
With a slightly hacky combo of AWS CLI and adapting from the existing import and run scripts I’ve managed to put together something that did the trick. Of course something more polished and straight-through in nature would be better, but it’s a start!
That walks your local bucket folder, going through the paired up Common Voice transcripts and mp3 files cleaning up the text of the former and converting the latter into .wav files in a data folder, then creating a .csv file for each of training, dev and test (in that same data folder)
NB: one problem with my bucket is a handful of transcript files w/o corresponding .mp3 files - I should clean them up, but for now I just delete those transcripts after I sync.
One thing I would say to people who are reading this looking for ways to train DeepSpeech is to look into using the build in mechanisms to train the model. The bin/librivox script will fetch 55GB of audio and transcription from a variety of audio books for example and train the model using that. There is also a bin/voxforge that will download about 6GB of audio data and train the model on that.
Hi @sujithvoona2 - I don’t know whether there is a way to do what you propose but it’s worth searching over the forum and probably looking for anyone doing this general kind of thing (unrelated to Common Voice) on StackOverflow or even Googling it.
Also generally when posting it’s best if you indicate what you’ve tried already (this avoids repeating things you’ve already tried and furthermore, avoids giving any impression that you might be lazy and just asking someone else to do your work for you!)
Best of luck finding a solution!
If you do find a way to do it, it would be great to post details here so it is shared with others.