Hi,
I’m wondering if there’s a way to use DeepSpeech to analyze an audio file and see if a word of interest is included or not. This is similar to the problem: How to classify unknown words, how to ignore words. However, in this case I want to see if a word exists in a sentence. An example of this will be:
file1.wav - “this wav file does not have the word of interest”
file2.wav - “this wav file has the word of interest which is gfuel”
The caveat is that I’m planning to train only the word of interest (“gfuel”) and not train any other words in the audio file. The language model and the audio model will only include the trained word. I’m not sure if I’m misunderstanding this but I believe that there is some type of “threshold” in which it can recognize words, otherwise, deespeech will not output anything at all:
(desired outcome)
file1.wav - outputs: “”
file2.wav - outputs: “gfuel”
Another way is I can obtain confidence score through --json flag introduced in 0.5.1 in the metadata field and filter our sentences on a given threshold.
However, both of these methods don’t work. The first method creates incorrect inferences because of the limit language model therefore it will print “gfuel” b/c this is the only word I have trained on. The training data set looks like this:
wav_filename,wav_filesize,transcript
/root/speech/data/gfuel_custom/1.wav,241708,gfuel
/root/speech/data/gfuel_custom/2.wav,1139352,gfuel gfuel gfuel gfuel gfuel
/root/speech/data/gfuel_custom/3.wav,258092,gfuel
/root/speech/data/gfuel_custom/4.wav,462892,gfuel gfuel
/root/speech/data/gfuel_custom/5.wav,688172,gfuel gfuel gfuel
/root/speech/data/gfuel_custom/6.wav,227372,gfuel
/root/speech/data/gfuel_custom/7.wav,274476,gfuel
/root/speech/data/gfuel_custom/8.wav,376876,gfuel
/root/speech/data/gfuel_custom/9.wav,745516,gfuel gfuel gfuel gfuel
/root/speech/data/gfuel_custom/10.wav,274476,gfuel
/root/speech/data/gfuel_custom/11.wav,167980,gfuel
When running this, I get the following:
root@speech:~/speech/data/gfuel_custom# deepspeech --model output_graph.pb --alphabet alphabet.txt --lm lm.binary --trie trie --audio test1.wav
…
gfuel
Inference took 0.419s for 15.440s audio file.
I have generated the language model by following these steps:
…/…/kenlm/build/bin/lmplz --text transcript.txt --arpa words.arpa --o 3 --discount_fallback
…/…/kenlm/build/bin/build_binary -T -s words.arpa lm.binary
…/…/native_client/generate_trie alphabet.txt lm.binary trie
Here are the parameters that I have used to train the audio dataset:
python3 -u DeepSpeech.py --noshow_progressbar
–train_files …/data/gfuel_custom/gfuel_custom.csv
–test_files …/data/gfuel_custom/gfuel_custom.csv
–train_batch_size 10
–dev_batch_size 10
–test_batch_size 5
–n_hidden 375
–epochs 33
–validation_step 1
–early_stop True
–earlystop_nsteps 6
–estop_mean_thresh 0.1
–estop_std_thresh 0.1
–dropout_rate 0.22
–learning_rate 0.00095
–report_count 100
–use_seq_length False
–checkpoint_dir …/data/gfuel_custom/checkpoint
–alphabet_config_path …/data/gfuel_custom/alphabet.txt
–lm_binary_path …/data/gfuel_custom/lm.binary
–lm_trie_path …/data/gfuel_custom/trie
–export_dir …/data/gfuel_custom/
I have played around with the parameters n_hidden, epochs since I’ve read somewhere this can be attributed to over fitting however it’s not producing any effect.
I have tried grabbing the confidence metadata (I had to download/install native_client), however, i’m getting inconsistent scoring that has no merit on what’s on the audio.
file1.wav
{“metadata”:{“confidence”:27.3069},“words”:[{“word”:“gfuel”,“time”:0.02,“duration”:5.56}]}
file2.wav
{“metadata”:{“confidence”:27.6635},“words”:[{“word”:“gfuel”,“time”:0.02,“duration”:7.48}]}
I have pretty much followed every step in Tune MoziilaDeepSpeech to recognize specific sentences and TUTORIAL : How I trained a specific french model to control my robot but coming into a conclusion that deepspeech cannot provide indication if word is in audio sentence and can only transcribe on best match on their available language model.
I would greatly appreciate if someone can confirm this or point me in the right direction.
Coding session: https://www.twitch.tv/videos/451440201?t=01h01m43s