Training your own DeepSpeech model [Tips]

I am creating this post just to list out the steps I took to train my model.

    I worked with data we built for TTS. with slight modification, I could use the same dataset for ASR. The script is available at this repo: The format is described there and as long as you mold your dataset to that, you’re good.

    Build an alphabet.txt with your target orthography. You can find a couple of examples of these files here: You need to reference these files while generating the trie and training. MAKE SURE YOU USE THE SAME FILE(SAME ORDERED ALPHABETS) FOR ALL OF THESE PROCESSES. If you change the order of the alphabets, make sure you regenerate the trie.

  3. util/
    For reference
    Larger dataset(out of what i’ve used)-30 gigs(777 hours training data)
    smaller dataset-30 mins to 8 hrs of training data
    check dropout_rate-(20% to 25% worked well for me)
    reduce n_hidden(depending on how large your dataset is. I’ve used 2048 for the common voice english dataset[i’ll post the results for english down below] which had around 770 hrs of training data. For smaller datasets ranging from 30 mins of voice to 8 hours, I’ve used 512 for good results, but YMMV).

Learning rate: 0.0001 is high enough to converge quickly. As soon as you see overfitting, drop it down to 75% of the current lr(manually). [A good model, I’ve noticed, has loss under two digit for larger datasets, and under 1 for smaller datasets]

  1. language model
    Language model influences your results a lot. With the smaller datasets, i’ve used around 8 gigs of monolingual data to build the language model which supplements the network. Also, if you just want to check if your deepspeech model is working as expected without the language model interfering(for debugging purposes), you can set the lm_alpha and lm_beta in util/ to 0 (These are the weighting params for the language model ).
    But you do not necessarily need large data to build the language models for good outputs, but make sure the data for the language model is comprehensive enough for your use case.

Eng results(In progress):

WER: 2.000000, CER: 0.090909, loss: 6.698823
 - src: "kettledrums"
 - res: "kettle drums"
WER: 1.500000, CER: 0.187500, loss: 20.059196
 - src: "workingday world"
 - res: "working day words"
WER: 1.500000, CER: 0.833333, loss: 28.203138
 - src: "common voice"
 - res: "the common voice as to"
WER: 1.500000, CER: 0.687500, loss: 37.482273
 - src: "eta eleventhirty"
 - res: "i venter in"
WER: 1.000000, CER: 0.050000, loss: 2.671792
 - src: "topsyturvy steamboat"
 - res: "topsy turvy steamboat"
WER: 1.000000, CER: 0.250000, loss: 3.145613
 - src: "amen"
 - res: "men"
WER: 1.000000, CER: 0.120000, loss: 4.288054
 - src: "ideas are uncopyrightable"
 - res: "ideas are an copyright able"
WER: 1.000000, CER: 0.100000, loss: 5.725993
 - src: "a steakandkidney pie"
 - res: "a steak and kidney pie"
WER: 1.000000, CER: 0.312500, loss: 11.599809
 - src: "dear mr thompson"
 - res: "damer thomson"
WER: 1.000000, CER: 0.312500, loss: 12.408183
 - src: "you alright mate"
 - res: "your right man"

Thank you for sharing this.

@alchemi5t, Why is that? Could you explain?

The trie built for decoding uses those alphabets in that exact order. To avoid mis-mapping, you need to have them in the same order. @SamahZaro

1 Like

Does training_batch_size impact accuracy? Ideally it impacts the speed of training…

I’ve left a decent answer here.

Please let me know if you have any new information or if you find this inconsistent.

Sounds interesting… Will share if I find it changes with batch_size…
I am training with common voice dataset, I think I need to generate the LM, Trie for this dataset… For that I need to vocab.txt… Any idea on generating it for CV dataset?