Need a deepspeech for dummies tutorial

Hi all,
I am new to Deepspeech and i wanted to train a model for my free spoken digits datset and i found this tutorial TUTORIAL : How I trained a specific french model to control my robot to train using our own data but i have the following questions like

  1. where do i place my dataset ? should it be placed under the deepspeech/data folder? or any where else? You can find my dataset from this github link (https://github.com/Jakobovski/free-spoken-digit-dataset)
  2. how the vocabulary.txt file should look like?
  3. if we split the whole dataset into test,train and dev then where should i put the vocabulary.txt file ?
  4. what is an arpa and why do we need it to build the lm model?
  5. I have deepSpeech installed inside a linux virtual machine in my PC and i do not have a GPU support in my device, would deepspeech training will work for my small dataset
    Like those questions i have many questions?
    Basically like in kaldi i need a “DeepSpeech for Dummies tutorial”

Where you want, since you can pass --train_files and others arguments

Please look at data/lm content

This question makes no sense to me, there’s no intersection between dataset and vocabulary file

This is KenLM-level, you jush ave to build it in the process, but you won’t need it after

Not sure I get the point here:

  • do you want help to get teh GPU working in the VM ?
  • do you want help to get it working on your basesystem where the GPU is available ?

Define your hardware, define your dataset. We can’t tell you without more context …

Hard to write when you don’t now what “dummies” might be. Training a model is non-trivial. What dummy do you target ? People who know nothing about machine learning ? People who are keen in machine learning but just new to DeepSpeech ?

As for the second kind, I’m experimenting that: https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/ so it’s easily forkable, hackable, reproductible for people who wants to ease the pain.

I agree with raghupathyv4 here, I also use Linux in VMs only. I’m guessing he’s asking for the same reason as I - since we don’t have dedicated Linux computers and a limited budget - we use virtual machines instead.

So I guess both his and my question is - how to make it work in a Linux VM, where there is no GPU.

Also, what is a KenLM-level?

I 100% agree that this is the worst documented tech for many years, and I’m also trying to:

  • make it work.
  • want to create my own recognizer, in a different language, and I can easily make voice files from different voices.
  • I want to include new words (local street names etc.)

Thanks for taking care of sharing your feelings. Doing documentation is hard, especially when the topic is complex ; however when we are being shared with actionable feedback on what to improve, we can do things.

There’s now a Playbook available: https://mozilla.github.io/deepspeech-playbook/

Unfortunately, if you need to do actual training, you will need a GPU.

This is documented. If all you can say is that it is poorly documented, I’m afraid we can’t help you.

This is documented and also covered in the playbook.

This is documented and also covered in the playbook.

1 Like

Hi lissyx, thank you very much for the info and link to the Playbook, it will be instantly studdied :grinning: