Predicting mobile number

sanjay.pandey · April 13, 2019, 6:45am

Hello @lissyx @reuben i have created language model for my specific problem and trained the model on Common voice english data and my model had 1.3 loss on training set.
The predicting is working quite good but i am confuse how to deal with mobile number on language model? i have written every single number from zero to nine (in words) line by line but i think so single word is not helping the model to predict better so what can i do?
i thought the other method can be writting 10 digit number with 9! arrangement. Can you suggest me what can i do?

tuttlebr · April 14, 2019, 3:04am

Add ’ --discount_fallback’ during the build of the language model is a start.

Also, check the kenLM GitHub for others with the same problem.

github.com/kpu/kenlm

kenlm class-based lm usage

opened 01:40PM - 13 Sep 17 UTC

closed 06:07PM - 13 Sep 17 UTC

oplatek

Hi, I noticed in the documentation that you explain that `Kneser-Ney` smoothi…ng has problems with classed-based models. Can you please advise me or hint to the documentation how to actually use kenlm for training class-based models and also use CB LMs for estimation? My use-case: I have in domain text corpus with abstracted classes such as: `My phone number is DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT` I want to model the probabilities on the abstracted corpora. However, I want to use the probabilities and `expand` the abstracted classes to instances so the following sentence can be scored by the language model. `My phone number is 7 7 7 6 6 6 6 6 6` How does the expanded ngrams should be normalized? Is there any tool for doing that? If I naively expand the ngrams simply by replacing the class `DIGIT` with all the possibilities the model will not be normalized. How can I interpret the perplexity, use querying etc? What else may break? On the other hand, I want to keep the probability of the abstracted ngrams for the expanded version because I want the sentence `My phone number is 777 666 666` to have the probability of the abstracted version `My phone number DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT`, right? Thank you for any help

sanjay.pandey · April 14, 2019, 4:56am

Thank you for reply but i have already added discount fallback.
Actually I want to convert speech to text for anyone who is saying their mobile number.
The problem is I am cofused whether writing number like for an example
double
triple
zero
one
two
…
nine
is right format to do or I should include zero to nine in a single sentence with different combinations?

tuttlebr · April 14, 2019, 1:19pm

I would not train on every possible combination of phone numbers. The kenLM approach assumes the corpus will hint at what words generally appear next to each other and then create a probability score.

You can tell which if these are more likely:
“I would like to help”
“I to like would help”
The language model would figure out that a more likely representation of this ngram is “I would like to help” given its chain probability score.
If you’re training on all combinatorial numbers, you’re not really helping the language model learn how to tell what number sequences are correct since every possible sequences had been fed on training. I think.

I’m curious what your experimentation shows. Phone numbers and addresses may benefit from separate models. I remember reading on this discorse about a user trying to transcribe voicemails. Perhaps they have some options as well.

Topic		Replies	Views
Some words getting skipped in whole sentence DeepSpeech	9	680	May 22, 2019
Recognizing phone numbers DeepSpeech	1	997	March 29, 2018
Want to create a language model with order of 1 DeepSpeech	1	251	December 18, 2019
Customizing Model DeepSpeech	6	576	June 25, 2019
How to do Contact Name Recognition using DeepSpeech DeepSpeech	3	756	November 7, 2018

Predicting mobile number

Related topics