V0.9.3 - Improving accuracy of numbers, dollar values and calendar dates

MarcS · January 8, 2021, 9:31pm

Environment: Ubuntu 18.04 on x64 platform
Release: Deepspeech v0.9.3
Python Version: 3.6.x
Language: US English
Mic: eMeet

I was able to correctly install the 0.9.3 entire release, including all of the tools/scripts, nvidia-tensorflow-gpu, KenLM, as well as pre-built model and scorer. I can successfully train on the Mozilla Common Voice 6.1 corpus (I just wanted to test that the install and configuration were correct), as well as go through the process to rebuild/duplicate the included librispeech based scorer. No errors or exceptions are being thrown during either of these processes.

Out of the box I have an accuracy rate approaching 100% for standard word based sentence constructs like "The quick brown fox …, Now is the time for… , etc), and I can also speak a sequence of standard digits (i.e. - one, two, three, four, five …) in quick succession with a near perfect recognition rate as well. But the accuracy for ‘real world’ numbers (i.e. - one hundred thousand ) as well as dates (i.e. - January seventh two thousand and twenty one) is far less. I would like to improve the recognition accuracy of both real world numbers, dollar values, as well as dates.

My first question is, has someone already developed an English language model/scorer with improved accuracy in these two key areas, having a non-restrictive (Apache, Creative Commons, MIT, etc. ) license? If not, what are the steps that I would need to go through to improve accuracy of the exiting model/scorer in these areas?

Prior to posting this note, based upon “Newb Theory” I did try to create a custom scorer, based upon a modified version (additional sentences for numbers and dates were added) of the ‘librispeech-lm-norm.txt’ file which I downloaded from the OpenSLR.org website. Following the exact instructions provided on deepspeech.readthedocs.io for re-creating the pre-built scorer. But there was no improvement in recognition accuracy. I also tried the same process using a brand new text file containing just a few sentences, but there was no change to recognition accuracy either.

Any help or guidance would be greatly appreciated.

othiele · January 8, 2021, 10:31pm

First, run your audio without a scorer argument to see what the acoustic model detects in your test numbers.

Then change the custom scorer to contain a lot of combinations of the numbers you want to detect. Simplified, the language model gives a probability of this exact combination of spoken numbers in relation to all text. So, ideally you have some occurences of this exact combination, not just the numbers in general.