Improving accuracy by creating a specific model?


Have recently installed Deepspeech and run it against a small WAV file - 55 words
275 characters, length - 19.968s

The accuracy is not good at all, with an error rate of 43.63%. If I use a small script to test the same audio with Google Speech Recognition , the error rate is only 20%. Still high but better.

Have even cut that WAV down to 9 seconds, yet no change in accuracy. Initially used the command as per docs

deepspeech models/output_graph.pb my_audio_file.wav models/alphabet.txt

and have added the files lm.binary and trie to the command, and tests reveal no change.

The audio is English and very clear. I followed the guidlines re specifications for the WAV and it is the same as the sample WAV’s for Deepspeech.

In regards to these sample model files:


what bearing do they have on accuracy ? Do I need to create a specific model to improve the accuracy, or ‘add to’ (train) the sample models ? There are hundreds of audio files that we have for just one speaker/person, and we wish to be able to improve the accuracy substantially.

Just wondering that because this is for a specific purpose, is it better to create a model for that purpose only, rather than use the existing (sample) models ?


This tutorial is very good and may be the solution - TUTORIAL : How I trained a specific french model to control my robot

(Vincent Foucault) #3

Hi. What is your “specific purpose” ?
The answer is difficult :

1/ You have a powerfull PC, and want, in the futur, use your own model for a complete STT model, why not keep the provided Deepspeech model, and improve it with your voice and sentences ?!
Run a new train :

  • call the existing model : --initialize_from_frozen_model
  • create a new empty checkpoint dir : --checkpoint_dir
  • call all params you need, and launch.

ex : Feed in your own params !!!

python -u
–initialize_from_frozen_model /home/nvidia/DeepSpeech/data/alfred/results/output_graph0.095.pb
–checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/fine_tuning_checkpoints
–train_files /home/nvidia/DeepSpeech/data/alfred/xxx.csv
–dev_files /home/nvidia/DeepSpeech/data/alfred/xxx.csv
–test_files /home/nvidia/DeepSpeech/data/alfred/xxx.csv
–train_batch_size 20
–dev_batch_size 20
–test_batch_size 15
–n_hidden 375
–epoch 18
–validation_step 1
–early_stop True
–earlystop_nsteps 10
–estop_mean_thresh 0.03
–estop_std_thresh 0.03
–dropout_rate 0.12
–learning_rate 0.001
–beam_width 500
–report_count 50
–lm_weight 5
–valid_word_count_weight 3
–decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/
–alphabet_config_path /home/nvidia/DeepSpeech/data/alphabet.txt
–lm_binary_path /home/nvidia/DeepSpeech/data/lm.binary
–lm_trie_path /home/nvidia/DeepSpeech/data/trie \

2/Your PC is lower as you expected (they are, LOL):
Why not create a homemade model from scratch ?


Option 2 seems more sensible, as the load of DeepSpeech running on that computer is unacceptable. I feel adding to the current model will only put extra load on the computer. These audios are from 44 minutes to 1 hr 18 mins, so the 5 to 10 seconds audio length limitation at present, is simply not an option.

Has anyone been able to use DeepSpeech on audios of an hour or more ? If so, how long did it take and what are the specifications of the computer ?

I have even wondered if I should learn to touch type, to create transcriptions …seriously. :slight_smile: