Research comparisons of WER between official DeepSpeech model and commercial tools 🧐

Hi everyone,
We are conducting tests to establish the WER between the official DeepSpeech pre-trained model (0.4.1 for now) and commercial tools (Watson, Google, Azure, AWS). Has anyone done something similar?

Results will vary widely (and wildly) depending on the test dataset used. We have focused on English but want to cover other languages soon.

It would be great to collaborate or chat about approaches if anyone is also already working on this.

2 Likes

There is a company called Descript which posted on a similar topic on Medium. Perhaps it will provide real-world guidance and expectations for WER.

I’d like to help. My main job over the last year or so has been technical marketplace intelligence in NLP including speech to text. There doesn’t seem to be anywhere near as much interest in the community to compete against such performance indicators in speech to text vs say text-based conversational AI (e.g. SQuAD, CoQA etc) but maybe there will be in future. One issue we came across was that for arbitrary-length files, WER’s cumulative error rate and apparently arbitrary sentence chunking methods made it, in our research anyway, impractical so we’ve been evaluating using difflib that produces 100% reliable insert / delete / omission counts for any two texts, as long as you preprocess each word / token to separate lines.

This isn’t a problem right now for DeepSpeech with its “sentence-length” / “whole recording visibility” constraint (practically meaning you needing to break up an audio recording into 5-7s chunks) but will be an issue for dictation scenarios. Although with the streaming work maybe this has been eliminated – please school me if I’m out of date.

We published our WER results, 8.3%, for the 0.4.1 version of DeepSpeech on the LibriSpeech clean test data set (LS-c) , which you can verify if you want.

Franck Dernoncourt benchmarked commercial engines on the LibriSpeech clean test data set (LS-c) and found that the best commercial engine on LS-c was Speechmatics with a WER of 7.3% and the second best was IBM with 9.8%

ASR API Date CV F IER LS-c LS-o
Human 5.8 12.7
Google 2018-03-30 23.2 24.2 16.6 12.1 28.8
Google Cloud 2018-03-30 23.3 26.3 18.3 12.3 27.3
IBM 2018-03-30 21.8 47.6 24.0 9.8 25.3
Microsoft 2018-03-30 29.1 28.1 23.1 18.8 35.9
Speechmatics 2018-03-30 19.1 38.4 21.4 7.3 19.4
Wit.ai 2018-03-30 35.6 54.2 37.4 19.2 41.7
1 Like

Hi @tinok,

Were you able to conduct your tests? Do you have results yet?

Thanks

I’m surprised not to see Voicebase in there – our internal tests have shown good results. I guess it comes down to focusing on broadband acoustic models (vs those over traditional phone lines?)