working with deepspeech we noticed that our overall recognition rate is not good. This doesn’t accord with what we were expecting, especially not after reading Baidu’s Deepspeech research paper.
We are using the cpu architecture and run deepspeech with the python client. (Switching to the gpu-implementation would only increase inference speed, not accuracy, right?)
To get a measurement of the accuracy we used a python implementation of the WER (word error rate) to analyse the results. The overall performance on 20 samples (male & female) is an error rate of 90%.
We used the pre-trained model as described in the README. The audio files we used were self-recorded 16kHz, mono, wav-format files with some background noise.
🐳 marvin models # deepspeech output_graph.pb /path/to/file/05_m_rosie_robot.wav alphabet.txt lm.binary trie Loading model from file output_graph.pb Loaded model in 0.514s. Loading language model from files lm.binary trie Loaded language model in 1.930s. Running inference. utewilknorosi Inference took 12.948s for 4.250s audio file.
We would like to provide the audio file as well (Text: “From what series do you know Rosie the robot?”) but .wav isn’t supported for upload. Therefore via wetransfer: https://we.tl/J4wpgkS94Y
Please advise on any further information you need to investigate and reproduce this issue.
We’re looking forward to hearing from y’all!