Hi everyone,
We are conducting tests to establish the WER between the official DeepSpeech pre-trained model (0.4.1 for now) and commercial tools (Watson, Google, Azure, AWS). Has anyone done something similar?
Results will vary widely (and wildly) depending on the test dataset used. We have focused on English but want to cover other languages soon.
It would be great to collaborate or chat about approaches if anyone is also already working on this.
Iād like to help. My main job over the last year or so has been technical marketplace intelligence in NLP including speech to text. There doesnāt seem to be anywhere near as much interest in the community to compete against such performance indicators in speech to text vs say text-based conversational AI (e.g. SQuAD, CoQA etc) but maybe there will be in future. One issue we came across was that for arbitrary-length files, WERās cumulative error rate and apparently arbitrary sentence chunking methods made it, in our research anyway, impractical so weāve been evaluating using difflib that produces 100% reliable insert / delete / omission counts for any two texts, as long as you preprocess each word / token to separate lines.
This isnāt a problem right now for DeepSpeech with its āsentence-lengthā / āwhole recording visibilityā constraint (practically meaning you needing to break up an audio recording into 5-7s chunks) but will be an issue for dictation scenarios. Although with the streaming work maybe this has been eliminated ā please school me if Iām out of date.
We published our WER results, 8.3%, for the 0.4.1 version of DeepSpeech on the LibriSpeech clean test data set (LS-c) , which you can verify if you want.
Franck Dernoncourt benchmarked commercial engines on the LibriSpeech clean test data set (LS-c) and found that the best commercial engine on LS-c was Speechmatics with a WER of 7.3% and the second best was IBM with 9.8%
Iām surprised not to see Voicebase in there ā our internal tests have shown good results. I guess it comes down to focusing on broadband acoustic models (vs those over traditional phone lines?)