Training a custom model

To begin with, thank you for creating this amazing effort and initiative. I’ve spent a few weeks of reading and experimenting with a DeepSpeech and I’m thrilled!

I’m building an offline voice assistant and think I think I’ve got all parts worked out. However I have a few questions that would really make me move forward. If everything works out I’m going to publish an updated tutorial on 0.7.4


  • I’m creating a voice assistant for a specific domain with ~120 words distributed typically over 50 phrases in variation.

  • It’s English for now but will move on to Swedish, my mother tounge when I’ve got something working.

  • I’m aiming to build an offline raspberry pi 4 experience meaning that I’m using the TensorFlow lite model.

  • I’m collecting my own training data, crowdsourced in a similar way common voice does it.


Q1. Does it matter if I collect recordings of words or phrases for the training?

Q2. Does it matter if I trim silence in my recording at the end and beginning?

Q3. Does it matter if I normalize then sound files and “polish them up” a bit?

Q4. How much training data do I need? Would eg 100 people saying all phrases 3 times be enough for a non fine-tune approach?

Q5. If using the fine tuning approach, how many epochs does make sense to run? Should I use negative epochs as stated in some post here?

Q6. If using the fine tuning approach, can I update the scorer or use my own?

Q7. What are some reasonable hyper parameters? My best bet right now for a training set of 4500 phrases are 400 hidden layers, 0.3 dropout, 0.0001 learning rate.

Q8. How do you guys avoid overfitting? I’m looking at tensorboard but that doesn’t show test curve and I simply just see that train and dev are converging but that’s it.

Q9. Does early stop prevents overfitting?

Q10. How do i show WER (the same output as after training)

So many question :slight_smile: would be enormously grateful to just get a few answered by anyone that knows.

warm regards,

@lissyx might have different answers, I guess some stuff is debatable - meaning you have to try.

Use phrases to get n-grams

Trimming is fine, leave at least 50 ms .


300 input files are very few, my guess is 3000. But try.

Depends on learning rate :slight_smile: Try small rate 1e-5/6 and 20 epochs.

Your own as you have specific sentences.

Could work, learning rate as above, dropout is fine, hidden layers didn’t matter much in my experiments. Use lm_optimizer afterwards.

Check numbers in log and have lots of checkpoints to test earlier epochs.

If you set it to 5 maybe. Not meant for small runs.

Run just the test set, not train/dev and check flags, there is one for more reporting.

In case you didn’t see it already (link), I’m building something similar, but I don’t fine-tune the DeepSpeech model with domain specific audio recordings. This makes it much easier to change the domain automatically, depending on what skills are installed. This approach has still a good accuracy, in my benchmark it performs better than Alexa. You can still fine-tune the network to archive even better results. Jaco doesn’t run on raspi yet, but I’m planning to support it in the next time.

Maybe you can find some inspirations for your project there, you can find it here:


Thank you so much for the answers Olaf and Dan :pray: Very helpful! Some additional questions:

When I build my vocabulary.txt (to be used for the scorer) that includes all the phrases. Should I also list the specific words as one-liners or is this only used to calculate probabilities for how word sequences look like?

Regarding sound file silence trimming my question is actually if I should bother doing the effort or if it doesn’t matter?

It really depends on your usecase / datasets, we can’t know in advance

Hard to tell as well

Have you tried just building a dedicated external scorer? Re-training for that specific purpose is a lot of work, and you might already achieve your goal this way

What is the question here? Training data should match as best as possible real-life usage

Same. If you trim silence, then your model won’t learn about silence and it might be problematic depending on your usecase

Same, you likely don’t want to normalize at training time, otherwise it means you’ll likely have to perform the same kind of preprocessing at inference

We can’t know in advance

Read the doc, this is not valid code anymore

That depends on your usecase, we document how to rebuild the scorer, so you can create yuour own.

Test evaluation is only done on the last best-dev checkpoint, I don’t get your point here, it makes no real meaning to push that in tensorboard.

Without a plot, we can’t comment on your fitting of training.

You might need to tune for it, early stop by default might not work for you. ?

Just include all your phrases

I’m always training until early stopping (7 epochs without change, 3 epochs to reduce learning rate on plateau). Might not yield the optimal results, but I’m quite happy with it currently

You should try this before running your own trainings. You can build a prototype and improve it later on. Or test it with my project, with does exactly this;)


Thanks again for the great answers :pray:

Hi, wondering what WER to be reported? as the model returns Best, Median, and Worst WER.