Converting numbers in textual form, to numerical values in STT output

Hi all,

Deep Speech outputs all characters, no numbers. So let’s say I want to output “I want 2000$ now” instead of “I want two thousand dollars now”, which is what Deep Speech will output. I was just wondering what are the approaches people have tried in this regard. Do people use rule-based heuristics to post-process the output after speech recognition? Or while training, do they make the alphabet include numbers and dollars as well (not a favourable approach in my case as I want to use pretrained model, and not train from scratch)? Or do people use some language model for this purpose, at the end?
This question of mine is not specific to Deep Speech, so I apologise if this is the wrong place to post, but since I am working with Deep Speech itself and want to address this issue, I have posted it here. I’d really appreciate any help I get :slight_smile:

Usually people train on a corpus, pairing of audio and associated transcripts, that contains all numerical expressions explicitly written out. So a transcript will contain “twenty ten” instead of “2010”. Thus, the model learns to map audio to numerical expressions explicitly written out.

There are a few reasons why this is done:

  • If the corpus is created from people reading sentences, then “2010” is ambiguous. Is this read “twenty ten”, “two zero one zero”, “two thousand and ten”… To resolve this ambiguity numerical expressions are explicitly written out.
  • The mapping from audio to text the system has to learn is easier if it transcribes numerical expressions explicitly. For example, if I say “two zero one zero” should that be transcribed as “2 0 1 0”, “20 10”, “2010”, or some other way. Usually the answer is context specific. For the year of a date it’s likely “2010”, for a lock combination “2 0 1 0” or maybe “20 10”. With out more context, you can’t know, which makes it a harder task.

Usually what speech to text engines do, if they want to have explicit numerical values in their transcript, e.g. “2010”, is that the have a post processing step that looks at the context (Does this explicitly written number appear as a date? Does this explicitly written out number appear as a temperature? As a lock combination?) then transforms the explicitly written number to digits dependent upon the context.

2 Likes

Thanks for the quick reply kdavis. So this post processing step is usually just simple rule-based heuristics right? Or is it also common to train a model and use it to look at the context and decide how to make the conversion to numerical values?

In the cases I’ve seen it’s rule based heuristics.

I had same question in the past with cmusphinx,
and yes I needed to use letters only, instead of numbers.
Morever, if you want to use model in a bot( like a conversational one), it will be difficult to integrate numbers (or it will need pre-processing)

Hi,

I’m looking for some libraries for converting numerical words into digits number. Could you tell me, please, where have you seen these rule based heruistics? I would be thankful!

Best to try Google. Depending on your needs, something like this might do the trick: https://pypi.org/project/word2number/

Hi @nmstoker and thank you for the answer!

I tried word2number but it works only for english sequences. I need a solution for german/french/english.

What do you mean by Google? Do they have a library for this ? I would be glad for your answer

Have a good day!

I see. I didn’t realise you’d tried that and also didn’t realise that you needed it for those three languages, as it wasn’t clear in your post.

Sorry I wasn’t clear by mentioning Google - I meant to try googling for an answer. Depending on your needs there are likely to be people who’ve tried this either practically (eg discussing on sites like StackOverflow etc) or academically (eg with research which might be published on Arxiv etc).

There’s an article covering some of the concepts here: https://machinelearning.apple.com/research/inverse-text-normal

That has some references to research and crucially it lists terms commonly used to describe this which may help with further googling for solutions. Reverse or inverse text normalisation is the key one.

2 Likes

In my voice assistant project I’m using Duckling for number and date conversions. It does support many different languages.

4 Likes

Another option is extract_number function of Lingua Franca

3 Likes

I’m also playing now with duckling but i’m running into a problem pretty often, maybe you can help me, sir?
response json: {‘message’: ‘An invalid response was received from the upstream server’}

Error happens always when i put a german letter with umlaut such as ü/ö/ä .
Do you apply any encoding?
Sentence such as " fünf sechs sieben fünf" would lead to an error…
input_transcript(in the payload below) is a string

payload = f’locale=de_DE&text={input_transcript}&dims=number’
headers = {
“Content-Type”: “application/x-www-form-urlencoded; charset=UTF-8”
}
try:
response = requests.post(
url,
data=payload,
headers=headers,
)
if response.status_code == 200:
json_response = response.json()

I don’t have problems with umlauts.
Currently I’m using duckling with the container from rasa, you can get it with docker pull rasa/duckling, but I don’t think that should make a difference.

My request:
curl -XPOST http://0.0.0.0:8000/parse --data 'locale=de_DE&text=fünf sechs sieben fünf'

Returns:
[{"body":"fünf","start":0,"value":value":5,"type":"value"},"end":4,"dim":"number","latent":false},{"body":"sechs","start":5,"value":"value":6,"type":"value"},"end":10,"dim":"number","latent":false},{"body":"sieben","start":11,"value":"value":7,"type":"value"},"end":17,"dim":"number","latent":false},{"body":"fünf","start":18,"value":"value":5,"type":"value"},"end":22,"dim":"number","latent":false}]

2 Likes

Thanks @dan.bmh!

I solved this issue by using url encoding and url decoding .
I used urllib.parse.quote() and urllib.parse.unquote().
Worked!