So having done some tests with DeepSpeech for my bachelor thesis, I stumbled upon the following: The WER, I calculated for my test dataset was off by a few percentage points compared to the WER reported by the DeepSpeech.py script on the same test dataset at the end of training.
Initially, I assumed WER is reported for a test dataset by calculating the average of the WER of every single audio file in the test dataset, so that is how I implemented it. But looking in the source, I realised it is not done so with DeepSpeech: From DeepSpeech repo:
wer = sum(s.word_distance for s in samples) / sum(s.word_length for s in samples)
Here all edits (substitutions, deletions, insertions) of all audio files are summed and then divided by the sum of all the words across all the ground truths. Consequently, as I understand it, the entire dataset is considered as one big transcription, which I guess intuitively also makes sense?
Now obviosly these two approaches can yield very different results, so is there any reason it is done in this way? Is this how WER is assumed reported for a test dataset? I have not looked extensively, but having quickly looked through some of the SotA papers on Papers With Code, I cannot seem to find a description of how the WER is reported for the whole test dataset?