Tutorial: Training a Dutch model

abbycabs · July 8, 2020, 3:05pm

Hi all!

I found some time to play around with training a model. I put together this tutorial on how I trained a Dutch model with DeepSpeech and Common Voice using Google Colab’s free resources:

https://colab.research.google.com/drive/12AVZWydY07O2k2eGRlri4pwPMJibs_0v

This is mostly a re-creation of the existing DeepSpeech documentation on training a model / creating a scorer. I used comments in Colab to indicate whenever I had to add or change a step from what was in the docs.

I know this isn’t the most accurate model & I ran into a few problems! Would love any advice/suggestions to improve results. But I’m still impressed at the results of open source, open data and the free hardware on Colab.

As the Mozilla Foundation focuses more on trustworthy AI, I’d love to do more with DeepSpeech and support the work you’re doing. Happy to put together other tutorials or update docs when appropriate based on the comments in Colab. Let me know how I can help!

lissyx · July 8, 2020, 4:06pm

// NOTE: I coudln’t get lm_optimizer.py to run, but the scorer was good enough to move on to the next step.

You need an existing model to use it, and in your case there is not yet a checkpoint existing, this is exactly what the error message states.

You only have training data from Common Voice right ? And dutch only has ~7h usable ? I see it’s last december release, the one from june should have ~45h.

It could be interesting to perform transfer learning from English model using those data.

But it’s very nice to see it working well on Colab, given people asking for it.

abbycabs · July 8, 2020, 5:42pm

I forgot about the Common Voice release! Will try with the larger dataset – thanks for the reminder.

You need an existing model to use it, and in your case there is not yet a checkpoint existing, this is exactly what the error message states.

I tried running lm_optimizer again after training the model, but the checkpoints didn’t load properly. I’ll try again when I re-run this with the new CV data & share results here!

Are you able to see the comments I added on the right in Colab on the steps? I’m new to Colab and not sure how commenting works!

SanderE · July 9, 2020, 12:19pm

It might also help to put the hot topic of fresh “stroopwafel’s” on the european parlement agenda, it would probably help the languagemodel in this specific case a lot.
What also could help is getting the dutch wikipedia scraped for sentences and use that for the languagemodel (there is a nice lemma about them, not good for the sugar gravings).
Will see if I can get that done in some of my free time !

abbycabs · July 9, 2020, 1:00pm

Thanks @SanderE! I did download the dutch wikipedia data, but it was formatted in xml and I didn’t feel like reformatting . Let me know if you make progress on this! Sounds like a good next step.

I stroopwafels, but you make a very good point.

lissyx · July 9, 2020, 1:45pm

Maybe you should have a look at the Common Voice Wikipedia Extractor project to get hints there?

lissyx · July 9, 2020, 1:46pm