Converting numbers in textual form, to numerical values in STT output

(Teamprobable) #1

Hi all,

Deep Speech outputs all characters, no numbers. So let’s say I want to output “I want 2000$ now” instead of “I want two thousand dollars now”, which is what Deep Speech will output. I was just wondering what are the approaches people have tried in this regard. Do people use rule-based heuristics to post-process the output after speech recognition? Or while training, do they make the alphabet include numbers and dollars as well (not a favourable approach in my case as I want to use pretrained model, and not train from scratch)? Or do people use some language model for this purpose, at the end?
This question of mine is not specific to Deep Speech, so I apologise if this is the wrong place to post, but since I am working with Deep Speech itself and want to address this issue, I have posted it here. I’d really appreciate any help I get :slight_smile:

(kdavis) #2

Usually people train on a corpus, pairing of audio and associated transcripts, that contains all numerical expressions explicitly written out. So a transcript will contain “twenty ten” instead of “2010”. Thus, the model learns to map audio to numerical expressions explicitly written out.

There are a few reasons why this is done:

  • If the corpus is created from people reading sentences, then “2010” is ambiguous. Is this read “twenty ten”, “two zero one zero”, “two thousand and ten”… To resolve this ambiguity numerical expressions are explicitly written out.
  • The mapping from audio to text the system has to learn is easier if it transcribes numerical expressions explicitly. For example, if I say “two zero one zero” should that be transcribed as “2 0 1 0”, “20 10”, “2010”, or some other way. Usually the answer is context specific. For the year of a date it’s likely “2010”, for a lock combination “2 0 1 0” or maybe “20 10”. With out more context, you can’t know, which makes it a harder task.

Usually what speech to text engines do, if they want to have explicit numerical values in their transcript, e.g. “2010”, is that the have a post processing step that looks at the context (Does this explicitly written number appear as a date? Does this explicitly written out number appear as a temperature? As a lock combination?) then transforms the explicitly written number to digits dependent upon the context.

(Teamprobable) #3

Thanks for the quick reply kdavis. So this post processing step is usually just simple rule-based heuristics right? Or is it also common to train a model and use it to look at the context and decide how to make the conversion to numerical values?

(kdavis) #4

In the cases I’ve seen it’s rule based heuristics.

(Vincent Foucault) #5

I had same question in the past with cmusphinx,
and yes I needed to use letters only, instead of numbers.
Morever, if you want to use model in a bot( like a conversational one), it will be difficult to integrate numbers (or it will need pre-processing)