Different methods for calculating WER for a test dataset

chrillemanden · May 19, 2020, 8:44pm

So having done some tests with DeepSpeech for my bachelor thesis, I stumbled upon the following: The WER, I calculated for my test dataset was off by a few percentage points compared to the WER reported by the DeepSpeech.py script on the same test dataset at the end of training.

Initially, I assumed WER is reported for a test dataset by calculating the average of the WER of every single audio file in the test dataset, so that is how I implemented it. But looking in the source, I realised it is not done so with DeepSpeech: From DeepSpeech repo:

wer = sum(s.word_distance for s in samples) / sum(s.word_length for s in samples)

Here all edits (substitutions, deletions, insertions) of all audio files are summed and then divided by the sum of all the words across all the ground truths. Consequently, as I understand it, the entire dataset is considered as one big transcription, which I guess intuitively also makes sense?

Now obviosly these two approaches can yield very different results, so is there any reason it is done in this way? Is this how WER is assumed reported for a test dataset? I have not looked extensively, but having quickly looked through some of the SotA papers on Papers With Code, I cannot seem to find a description of how the WER is reported for the whole test dataset?

reuben · May 19, 2020, 9:40pm

As far as we can tell there is a de facto convergence on the way Kaldi does it, and that is what we do. We used to report a mean of means at some point but someone complained it wasn’t matching Kaldi and we changed it

chrillemanden · May 20, 2020, 7:14pm

Thank you for your answer. Looking at kaldi’s source code, I can see now, that is also how they do it.

I still wonder if this is also the method researchers use, when they report WER on for example LibriSpeech, but I see that is probably not a question that can be answered here. Once again thanks

Piotrc · July 27, 2021, 3:58pm

I assumed WER is reported for a test dataset by calculating the average of the WER of every single audio file…

This is not correct. WER is well defined metric, you can find it in many places:

If you test many utterances, than:

S is the number of all substitutions,
D is the number of all deletions,
I is the number of all insertions,
C is the number of all correct words,
N is the number of all words in the reference (N=S+D+C)