Deep Speech outputs all characters, no numbers. So let’s say I want to output “I want 2000$ now” instead of “I want two thousand dollars now”, which is what Deep Speech will output. I was just wondering what are the approaches people have tried in this regard. Do people use rule-based heuristics to post-process the output after speech recognition? Or while training, do they make the alphabet include numbers and dollars as well (not a favourable approach in my case as I want to use pretrained model, and not train from scratch)? Or do people use some language model for this purpose, at the end?
This question of mine is not specific to Deep Speech, so I apologise if this is the wrong place to post, but since I am working with Deep Speech itself and want to address this issue, I have posted it here. I’d really appreciate any help I get
Usually people train on a corpus, pairing of audio and associated transcripts, that contains all numerical expressions explicitly written out. So a transcript will contain “twenty ten” instead of “2010”. Thus, the model learns to map audio to numerical expressions explicitly written out.
There are a few reasons why this is done:
If the corpus is created from people reading sentences, then “2010” is ambiguous. Is this read “twenty ten”, “two zero one zero”, “two thousand and ten”… To resolve this ambiguity numerical expressions are explicitly written out.
The mapping from audio to text the system has to learn is easier if it transcribes numerical expressions explicitly. For example, if I say “two zero one zero” should that be transcribed as “2 0 1 0”, “20 10”, “2010”, or some other way. Usually the answer is context specific. For the year of a date it’s likely “2010”, for a lock combination “2 0 1 0” or maybe “20 10”. With out more context, you can’t know, which makes it a harder task.
Usually what speech to text engines do, if they want to have explicit numerical values in their transcript, e.g. “2010”, is that the have a post processing step that looks at the context (Does this explicitly written number appear as a date? Does this explicitly written out number appear as a temperature? As a lock combination?) then transforms the explicitly written number to digits dependent upon the context.
Thanks for the quick reply kdavis. So this post processing step is usually just simple rule-based heuristics right? Or is it also common to train a model and use it to look at the context and decide how to make the conversion to numerical values?
I had same question in the past with cmusphinx,
and yes I needed to use letters only, instead of numbers.
Morever, if you want to use model in a bot( like a conversational one), it will be difficult to integrate numbers (or it will need pre-processing)
I’m looking for some libraries for converting numerical words into digits number. Could you tell me, please, where have you seen these rule based heruistics? I would be thankful!
I see. I didn’t realise you’d tried that and also didn’t realise that you needed it for those three languages, as it wasn’t clear in your post.
Sorry I wasn’t clear by mentioning Google - I meant to try googling for an answer. Depending on your needs there are likely to be people who’ve tried this either practically (eg discussing on sites like StackOverflow etc) or academically (eg with research which might be published on Arxiv etc).
That has some references to research and crucially it lists terms commonly used to describe this which may help with further googling for solutions. Reverse or inverse text normalisation is the key one.
I’m also playing now with duckling but i’m running into a problem pretty often, maybe you can help me, sir?
response json: {‘message’: ‘An invalid response was received from the upstream server’}
Error happens always when i put a german letter with umlaut such as ü/ö/ä .
Do you apply any encoding?
Sentence such as " fünf sechs sieben fünf" would lead to an error…
input_transcript(in the payload below) is a string
I don’t have problems with umlauts.
Currently I’m using duckling with the container from rasa, you can get it with docker pull rasa/duckling, but I don’t think that should make a difference.
My request: curl -XPOST http://0.0.0.0:8000/parse --data 'locale=de_DE&text=fünf sechs sieben fünf'