Help: how to generate a custom scorer?

Hi all,

I’m a beginner with DeepSpeech. I installed last version as specified here: https://deepspeech.readthedocs.io/en/v0.9.3/index.html

And I’m now able to transcript using the CLI command and the native client (BTW, I’m working on a micro opensource project to show how to use DS server from nodejs: https://github.com/solyarisoftware/DeepSpeechJs

Question 1:
Considering that I would like to use DS as short sentences ASR for a closed-domain chatbot, where there are specific kind of user utterances as:

  • spelled alphanumeric codes (e.g. M N Q U one two four six)
  • specific name entities as person names (e.g. Giorgio Robino, Giuditta Del Buono)
  • etc.

If I well understood I can improve the transcript accuracy of the pre-trained language model also “just” building a custom scorer file (customApp.scorer) to be used at run-time (avoiding to re-train the pretrained model with custom audio files):

deepspeech \
  --model deepspeech-0.9.3-models.pbmm \
  --scorer customApp.scorer \
  --audio sample.wav

That’s true?
BTW, There is any data/report that show quantitatively how accuracy rise using a custom scorer for specific closed-domain inputs?


Question 2:
I read documentation about how to create my own scorer file:
https://deepspeech.readthedocs.io/en/v0.9.3/Scorer.html#external-scorer-scripts

But I’m confused. There is any step-by-step tutorial that show how can I proceed?

:pray::pray: A step-by-step example would help a lot! Does it exists?

Where data/lm/generate_lm.py , and generate_scorer_package are located?

What’s the format of the original text file containing custom sentences?

If, by example, I want to let the ASR better understand 4 digit numeric codes:

one zero zero zero 
one zero zero one 
one zero zero two 
one zero zero three 
...
...
nine nine nine nine 

the text is a collection of all possible sentences possible, so in this case all numbers in letters between 0000 and 9999 ?


Question 3:
A last point is not clear to me. For a best result in general case I would extend the pretrained model scorer with a custom scorer. In this case, do I need to add custom sentences at the end of the original pretrained model scorer? Or building the custom scorer is the way t go?


BTW, my configuration:

(deepspeech-venv) uname -a 
linux itd-giorgio-laptop 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

(deepspeech-venv) $ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal

(deepspeech-venv) $ python --version
Python 3.8.5

(deepspeech-venv) $ deepspeech --version
DeepSpeech  0.9.3

(deepspeech-venv) $ sudo lshw -C display
  *-display                 
       description: VGA compatible controller
       product: WhiskeyLake-U GT2 [UHD Graphics 620]
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 00
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:129 memory:a1000000-a1ffffff memory:b0000000-bfffffff ioport:6000(size=64) memory:c0000-dffff

Thanks!
giorgio

Your link is already step-by-step, what is unclear please ?

In the repo

in the native_client release tar

Please look at how vocab-500000.txt is produced

I’m not sure I follow you, both suggestion end up being the same: take the “reproducing our external scorer” steps, add your sentences to the generated vocab-500000.txt file, and use generate_scorer_package.

1 Like

Unfortunately, it’s quite specific to each application, so we don’t have generic data. I guess you can find feedback from various contributors, and our own experience shows that it’s quite efficient, but I made no thorough analysis.

Thanks @lissyx

I’m not sure I follow you, both suggestion end up being the same: take the “reproducing our external scorer” steps, add your sentences to the generated vocab-500000.txt file, and use generate_scorer_package .

So you are suggesting to ADD custom sentences at the end of vocab-500000.txt file, extending the original file,

avoiding to building from scratch a vocab.txt containing JUST the custom sentences.

Right?

Again, it really depends on your usecase, and as I said, I’m unsure I properly understood yours.

Ultimately, have you understood the role of the scorer? It will help the acoustic model into “proof reading” what has been decoded, so:

  • if you want to constraint to a very specific domain, build a new scorer from scratch using sentences from that domain
  • if you want to add domain specific but keep something generic, add your data into the default one

Since a few release, we also have boosting of some words accessible in the API, please check this feature, it might help in your case as well.

1 Like

well, it depends on the application context of course, and I’d avoid to overfit on a too close-domain; I want to add domain specific but keep something generic as you say, so the joint of two sets seems the solution. Thx

Since a few release, we also have boosting of some words accessible in the API, please check this feature, it might help in your case as well.

I don’t know much. I’ll deepen.

Search for AddHotWord in the API docs, it’s exposed in all our bindings.

2 Likes

Thanks @lissyx for previous answers, but

I realized vocab-500000.txt is an output (non an input) of data/lm/generate_lm.py, just enumerating all words contained in the the “original text file”:

http://www.openslr.org/resources/11/librispeech-lm-norm.txt

So with the goal to create a custom scorer (building an enhanced language model), I understand I have to ADD my own custom sentences to the librispeech-lm-norm.txt.

Q1: That’s correct?

Q2: the format of the original text file seems to me a sentence per line, here a small chunk:

A A BAD WOMAN
A A BANK ROB
A A BEAR
A A BEAUTIFUL EPISODE FOR WHICH RECEIVE MY BEST THANKS
A A BEGAN THE ACCUSED ONE
A A BETTER MAN
A A BIG SPIDER
A A BLACK DEVIL
A A BLANK IMPERSONAL VACANT SET OF ROOMS
A A BLIND MAN PREACHES TO THREE MILLION PEOPLE A BOY'S MISTAKE A SAD RECONCILIATION A BUSINESS MAN CONFESSING CHRIST A CHILD AT ITS MOTHER'S GRAVE A CHILD LOOKING FOR ITS LOST MOTHER A CHILD'S PRAYER ANSWERED A CHILD VISITS ABRAHAM LINCOLN AND SAVES THE LIFE OF A CONDEMNED SOLDIER A COMMERCIAL TRAVELER A DAY OF DECISION A DEFAULTER'S CONFESSION A DISTILLER INTERROGATES MOODY A DREAM A DYING INFIDEL'S CONFESSION A FATHER'S LOVE FOR HIS BOY A FATHER'S LOVE TRAMPLED UNDER FOOT A FATHER'S MISTAKE AFFECTION AFFLICTION A GOOD EXCUSE A HEAVY DRAW ON ALEXANDER THE GREAT A LITTLE BOY CONVERTS HIS MOTHER A LITTLE BOY'S EXPERIENCE A LITTLE CHILD CONVERTS AN INFIDEL ALL RIGHT OR ALL WRONG A LONDON DOCTOR SAVED AFTER FIFTY YEARS OF PRAYER A LONG LADDER TUMBLES TO THE GROUND ALWAYS HAPPY A MAN DRINKS UP A FARM A MAN WHO WOULD NOT SPEAK TO HIS WIFE A MOTHER DIES THAT HER BOY MAY LIVE A MOTHER'S MISTAKE AN EMPEROR SETS FORTY MILLION SLAVES FREE ANGRY AT FIRST SAVED AT LAST AN INFIDEL WHO WOULD NOT TALK INFIDELITY BEFORE HIS DAUGHTER AN IRISHMAN LEAPS INTO THE LIFE BOAT A REMARKABLE CASE A RICH FATHER VISITS HIS DYING PRODIGAL SON IN A GARRET AND FORGIVES HIM ARTHUR P OXLEY
A A BOARD ON BOARD OF
A A BOAT
A A BOOK
A A BOY THAT AIN'T GOT A START YET
A A BRANCH
A A BREVE E E BREVE BREVE O O BREVE A A MACRON E E MACRON MACRON O O MACRON U U MACRON Y Y MACRON OE O E ARE RECORDED AS OE IN THE LATIN ONE AND ASCII TEXTS
A A BRIDLE HIS VOICE BARELY AUDIBLE
A A BROUGHAM

So if I want to integrate the above data with for example examples of spelled sequences of numbers (e.g. in 6 digits), I have to append to the previous mentioned file something like (just a small chunk here):

ZERO 
ZERO ZERO 
ZERO ZERO ZERO 
ZERO ZERO ZERO ZERO 
ZERO ZERO ZERO ZERO ZERO 
ZERO ZERO ZERO ZERO ZERO ZERO 
ZERO ZERO ZERO ZERO ZERO ZERO ZERO 
ONE 
ZERO ONE 
ZERO ZERO ONE 
ZERO ZERO ZERO ONE 
ZERO ZERO ZERO ZERO ONE 
ZERO ZERO ZERO ZERO ZERO ONE 
ZERO ZERO ZERO ZERO ZERO ZERO ONE 
TWO 
ZERO TWO 
ZERO ZERO TWO 
ZERO ZERO ZERO TWO 
ZERO ZERO ZERO ZERO TWO 
ZERO ZERO ZERO ZERO ZERO TWO 
ZERO ZERO ZERO ZERO ZERO ZERO TWO 
THREE 
ZERO THREE 
ZERO ZERO THREE 
ZERO ZERO ZERO THREE 
ZERO ZERO ZERO ZERO THREE 
ZERO ZERO ZERO ZERO ZERO THREE 
ZERO ZERO ZERO ZERO ZERO ZERO THREE 

That’s correct?


Q3: The generate_lm example in documentation is:

python3 generate_lm.py --input_txt librispeech-lm-norm.txt.gz ...

So, I have to gzip the librispeech-lm-norm.txt creating a new ibrispeech-lm-norm.txt.gz ?


Q4:
Consider again generate_lm example in documentation:

What exactly means python3 generate_lm.py --top_k 500000 ...

Why top_k is set to 500000 ?
I mean: I understand it’s the vocabulary dimension, and it has been set maybe in relation of the number of different words in the original text, but how can I tune this parameter especially if I want to introduce new words in the vocabulary?

BTW, I guess this is not the case of my example where words are numbers in letters (one, two, etc.) already present in vocabulary, but what if want to introduce a special jargon (e.g. healthcare/medical terms) with the goal to let DS recognize these new words?


Q5: Maybe I’m missing something basic, but why the documenattion suggest to use as language model (of pretrained DS model for English language) http://www.openslr.org/resources/11/librispeech-lm-norm.txt ?

I mean, is the librispeech list of sentences the standard reference ? The cource (https://www.openslr.org/11) is absolutely respectable, but the corpus aappear to be a bit obsolete (2014). I’m missing something?


Thanks everyone in advance for feedback.
At the end of my experiment, I’ll try to contribute proposing some documentation updates.

giorgio

yes

I dont think you need all those repetitions

I think we can handle both?

Why not? The idea is really to just filter garbage from the librispeech.

Giorgio, from your questions I get the impression you’re rushing to question and possibly overthinking things. It might be better to read the code a little more, look at the docs with care, maybe try the new Playbook and most of all: try a few things out to see what you figure out empirically.

As for Q5 - the way people speak English has not changed significantly since 2014, besides the odd neologism, so that source doesn’t seem to me like it would be obsolete. It’s effectively acting as an easily available massive set of text from which a language model can develop a good understanding of the relative probabilities of words following other words. In the spirit of “walk before you run” I’d stick with it for now if I were you and just append to it as per the other discussed points.

Neil,
yes, as researcher, in general I tend to overthink :wink:

Deepening DS behaviors, I confirm the official docs are imprecise here and there. By example some programs parameters are not explained in detail. The empirical way (code digging) is not the best approach for me, if the point is just to understand how to set a command line parameter. I’ll probably propose a pull request of above mentioned documentation.

Anyway, I thank you for the https://mozilla.github.io/deepspeech-playbook/ doc I didn’t know!

About the original corpus, sure 2014 is fine from the linguistics point of view. I asked because seems to me weird that we are referencing a 6 years old corpus. Reading the corpus sentences, at first glance, I’m a bit perplexed about the fact that’s a “good understanding of the relative probabilities of words following other words”. But it’s just an impression, I admit.

BTW, I thought that the sentences corpus (to produce the scorer in the pre-trained English model) was extracted from Common Voice transcription corpus. But is not like that, as far as I see.

My question is so: there is some documented analysis about the librespeech corpus is a good/right choice? This is not a criticism but just a simple question.

As always in science and engineering, we have to know the hypothesizes, to progress and “run” :slight_smile:

@lissyx I dont think you need all those repetitions

Why?

As said my goal is to try improving accuracy of sequences of digits/ alphanumeric codes.

E.g. consider a sequence of 7 digits:

2
02
002
0002
00002
000002
0000002

=> the custom corpus foresees:

TWO 
ZERO TWO 
ZERO ZERO TWO 
ZERO ZERO ZERO TWO 
ZERO ZERO ZERO ZERO TWO 
ZERO ZERO ZERO ZERO ZERO TWO 
ZERO ZERO ZERO ZERO ZERO ZERO TWO

So my try is to insert in the corpus all combinations of input word sequences. Doesn’t make sense for you? If not, why?

So I don’t know if you’re aware but people still use the Brown Corpus for various tasks and it’s from 1961.

Regarding the metrics/comparisons to establish if it’s the best choice, I think that would need to wait for a comment from the main contributors although given that the results of DeepSpeech are very good on various accuracy measures against known datasets, that alone gives a sense that the LM aspect must be good enough. If you wish to look into improving it then I’m sure that would be delighted to have a contribution / PR :slightly_smiling_face: There are various measures of LM effectiveness (eg perplexity, entropy etc) with details available by googling but again I’d suggest that with something like this it’s best to experiment - compiling the LM part is surprisingly quick (it takes much less time than asking a question and then you won’t need to tie someone up answering it :wink:)

Well I don’t know your specific problem, I just assumed you were working on just digits, not combinations like that.

If this reflects what you need / expect to catch, I guess it’s right ?

I’m not really sure about that analysis, it’s just the release date of the corpus, if you dig into, you can see it also features very old english (which has been problematic somehow)

First versions were based on a kinda epured dump of English Wikipedia, @reuben and @kdavis reworked from LibriSpeech and it improved test WER as well as in-the-wild tests we could do, so that’s how we switched. But there’s always room for improvements if you have enough timle

no, if you do that you bias you test step since your LM already knows about your train set

@kathyreid just published this, so early feedback to improve is more than welcome

4 Likes

It is really easy to understand and follow! Thank you @kathyreid
If you could find the time, could you please include a section on language model training as well?

Hi @Ex.L.R.8.Reign, there is a section in the PlayBook on how to create a custom scorer.

Oh yes! I must have missed it. Thank you!

1 Like

Hi Guys,

I have a question, I have custom dataset that I’m using for training the acoustic model. I also want to build custom LM, should I use the transcriptions used in training the acoustic model for creating the language model as well? Will that lead to bias? What if I use the transcriptions from the validation set?

Thanks!