How to deal with numbers and abbreviations?

thegame · November 30, 2023, 6:16pm

So when creating datasets, how do I deal with numbers and abbreviations? What is the best way?

When the speaker said One, should I write ‘1’ or ‘one’?
When the speaker said Doctor, should I write ‘dr’ or ‘doctor’?

kathyreid · December 2, 2023, 11:41am

Firstly a huge thank you for contributing to the Common Voice project!

I am just a community member, not a staff member, but would like to provide some information here.

The Common Voice Playbook has a lot of excellent information around how to approach collecting a text corpus but in reading it, it does not provide guidance on your questions above.

In general, abbreviations should be written in full - such as “Saint” for “St”, “street” for “st”, “avenue” for “ave”, “doctor” for “Dr” and so on. This is due to how speech recognition algorithms work, and specifically how they predict characters from audio recordings (such as connectionist temporal classification (CTC). If the abbreviations are given as abbreviations in the text corpus, the model becomes inaccurate.

Numbers should be written in cardinal form: “one hundred and twenty three”, not “123”. There are two reasons for this. Firstly, most speech algorithms use audio-to-character prediction, and we want the model to learn what the numerals sound like so that the model can generalise from say “one hundred and twenty three” to “two hundred and thirty one” - it needs to learn phrases like “thirty” and “twenty”. Secondly, different dialects of English phrase the same written numbers differently. I’m Australian, so I say “one hundred and ninety nine”. Americans are more like to say “one hundred ninety nine” (no “and”). So, by writing the numbers in full, the person speaking the sentence is going to say the same words, not interpret the numbers “199” differently. Again, this has implications for model accuracy.

Many speech recognition models then predict these sorts of words in full - so if I say “seven hundred and sixty five”, the output will be “seven hundred and sixty five” not “765”. A separate model called inverse normalisation is then applied to predict “765” from “seven hundred and sixty five”. This is an open problem in natural language processing because there is so much variation in natural language. For example, if I say “twelve fifty”, do I mean:

1250hrs
12.50pm
12.50am
$12.50
£12.50
1250
12-50
TwelveFifty Nightclub

and all of these could be correct, depending on the context.

In summary, you should spell out abbreviations and numbers.

bozden · December 5, 2023, 1:45pm

Many speech recognition models then predict these sorts of words in full - so if I say “seven hundred and sixty five”, the output will be “seven hundred and sixty five” not “765”.

Unfortunately, this is not the case for Open-AI Whisper, the pre-trained models do output digits. The answer is correct, in a way, but when you run automated tests to get CER/WER against the original Common Voice sentence, you get worse results.

One way to get “a more correct” CER/WER value is to postprocess the transcription and convert the numeric value to cardinal form, but it is not easy as it will be language and dialect-dependent as @kathyreid emphasized.

It is certainly not a good practice for a model to behave like this, because it loses information from the speech. In some applications, where you should enter some numbers, you can benefit from it of course.

Fortunately, there is a workaround by pre-handling them at the tokenizer stage, like the one described in one of the discussions on the Whisper repo.

kathyreid · December 6, 2023, 12:41am

I strongly agree with @bozden here - Whisper should do cardinal form then use separate layer to do inverse normalisation so that the inverse normalisation can be dropped and fine-tuned for different dialects. Excellent explanation, thank you!