Using Common Voice data with DeepSpeech

Would I be right in thinking that the publicly captured Common Voice data will at some point be used to train models in Mozilla’s DeepSpeech library?

I’ve been able to get Common Voice working locally myself and just recently managed to run the basic training example in DeepSpeech running successfully (on a GPU to boot), so I was thinking I’d take a look at how to wrangle the Common Voice data into the right form to use with DeepSpeech for training.

Is there a plan to do this kind of thing within the Common Voice or the DeepSpeech repos? (or perhaps neither?)

My guess (optimisitcally!) is that this may not be too hard, but thought I’d see whether it was on the cards or even already under way?

BTW: what I’m suggesting is basically as described in here:

so it seems likes it’s a matter of getting the AWS data out of my S3 bucket, downloaded locally and then generate a CSV for the files and their corresponding transcript text

Thanks for the question @nmstoker!

We absolutely plan to use the Common Voice data with Mozilla’s DeepSpeech engine. Our goal is to release the first version of this data by the end of the year, in a format that makes it easy to import into project like DeepSpeech.

While this is certainly in the cards, we haven’t started this process yet. Perhaps we can enlist your help once we pick up this work in earnest (probably in the November timeframe)?

That’s great, would be delighted to help if i can @mhenretty

With a slightly hacky combo of AWS CLI and adapting from the existing import and run scripts I’ve managed to put together something that did the trick. Of course something more polished and straight-through in nature would be better, but it’s a start!

@nmstoker I am also trying to use Common Voice to train Deep Speech – can you please post here for reference how you were able to do this?

Certainly Nikhil (but please don’t judge my code too harshly :wink: )

The steps are:

  1. Get the files down from AWS to somewhere local
  2. Run the import script to convert them from .mp3 to .wav and generate the .csv files
  3. Run the training script

For 1, I used AWS CLI:

You need to set up your credentials so it stores them locally then you can just navigate to a download folder, then run something equivalent to this:

aws s3 sync s3://your-voice-web-bucket .

You’ll see a whole load of your files download (very quickly if your experience is anything ike mine)

For 2 I used a script I’d cobbled together mainly from the other import scripts. The gist is here:

You run something equivalent to:

python ../your-local-voice-web-bucket-folder/ ./data

That walks your local bucket folder, going through the paired up Common Voice transcripts and mp3 files cleaning up the text of the former and converting the latter into .wav files in a data folder, then creating a .csv file for each of training, dev and test (in that same data folder)

NB: one problem with my bucket is a handful of transcript files w/o corresponding .mp3 files - I should clean them up, but for now I just delete those transcripts after I sync.

For step 3 I run this script which is based on the other examples provided:

So far I’m getting fairly good results but I need to create more Common Voice records (I’ve done about 1,800 or so) and I’ve no doubt got lots to learn about how best to tweak the DeepSpeech settings

I hope that helps - it’s a start, but there’s a lot that could be improved (easily!) Big thanks to the Mozilla teams for making both Common Voice and DeepSpeech so awesome!! :smile:

1 Like

Did a quick video of the steps above in case it’s helpful

1 Like

This is an amazing start @nmstoker!!! You’ve really given us a leg up when we start our integration (which we will be working on in November). Thank you for this!!!

1 Like

Thanks so much for the instructions @nmstoker.

One thing I would say to people who are reading this looking for ways to train DeepSpeech is to look into using the build in mechanisms to train the model. The bin/librivox script will fetch 55GB of audio and transcription from a variety of audio books for example and train the model using that. There is also a bin/voxforge that will download about 6GB of audio data and train the model on that.

Hi @sujithvoona2 - I don’t know whether there is a way to do what you propose but it’s worth searching over the forum and probably looking for anyone doing this general kind of thing (unrelated to Common Voice) on StackOverflow or even Googling it.

Also generally when posting it’s best if you indicate what you’ve tried already (this avoids repeating things you’ve already tried and furthermore, avoids giving any impression that you might be lazy and just asking someone else to do your work for you!)

Best of luck finding a solution! :slight_smile:

If you do find a way to do it, it would be great to post details here so it is shared with others.