Tune MoziilaDeepSpeech to recognize specific sentences

@sujithvoona2 I think the issue is that since writing the above instructions the process has changed, so what is above won’t work with 0.7.x - as @othiele suggests, it’s best to look over the forum for how to build the scorer. You could also refer to the documentation for your corresponding version here: https://deepspeech.readthedocs.io/en/v0.7.3/

2 Likes

And more specifically, https://deepspeech.readthedocs.io/en/v0.7.3/Scorer.html

2 Likes

Again, you hijack older threads. You have to build most of the system yourself. DeepSpeech only does speech to text. The rest is your part.

thanks .

https://stackoverflow.com/questions/64183342/commant-to-ddep-speech-by-deep-speech-to-farsi-data-set/64183407#64183407

Hello @nmstoker,

I want to create my own scorer file. When executing the generate_lm.py script, I have this output:

CBuilding lm.binary …
./data/lm/lm.binary
Usage: ./kenlm/build/bin/build_binary [-u log10_unknown_probability] [-s] [-i] [-v] [-w mmap|after] [-p probing_multiplier] [-T trie_temporary] [-S trie_building_mem] [-q bits] [-b bits] [-a bits] [type] input.arpa [output.mmap]

-u sets the log10 probability for if the ARPA file does not have one.
Default is -100. The ARPA file will always take precedence.
-s allows models to be built even if they do not have and .
-i allows buggy models from IRSTLM by mapping positive log probability to 0.
-v disables inclusion of the vocabulary in the binary file.
-w mmap|after determines how writing is done.
mmap maps the binary file and writes to it. Default for trie.
after allocates anonymous memory, builds, and writes. Default for probing.
-r “order1.arpa order2 order3 order4” adds lower-order rest costs from these
model files. order1.arpa must be an ARPA file. All others may be ARPA or
the same data structure as being built. All files must have the same
vocabulary. For probing, the unigrams must be in the same order.

type is either probing or trie. Default is probing.

probing uses a probing hash table. It is the fastest but uses the most memory.
-p sets the space multiplier and must be >1.0. The default is 1.5.

trie is a straightforward trie with bit-level packing. It uses the least
memory and is still faster than SRI or IRST. Building the trie format uses an
on-disk sort to save memory.
-T is the temporary directory prefix. Default is the output file name.
-S determines memory use for sorting. Default is 80%. This is compatible
with GNU sort. The number is followed by a unit: % for percent of physical
memory, b for bytes, K for Kilobytes, M for megabytes, then G,T,P,E,Z,Y.
Default unit is K for Kilobytes.
-q turns quantization on and sets the number of bits (e.g. -q 8).
-b sets backoff quantization bits. Requires -q and defaults to that value.
-a compresses pointers using an array of offsets. The parameter is the
maximum number of bits encoded by the array. Memory is minimized subject
to the maximum, so pick 255 to minimize memory.

-h print this help message.

Get a memory estimate by passing an ARPA file without an output file name.
Traceback (most recent call last):
File “./data/lm/generate_lm.py”, line 211, in
main()
File “./data/lm/generate_lm.py”, line 202, in main
build_lm(args, data_lower, vocab_str)
File “./data/lm/generate_lm.py”, line 127, in build_lm
“./data/lm/lm.binary”,
File “/usr/lib/python3.6/subprocess.py”, line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[’./kenlm/build/bin/build_binary’, ‘-a’, ‘255’, ‘-q’, ‘8’, ‘-v’, ‘tree’, ‘./data/lm/lm_filtered.arpa’, ‘./data/lm/lm.binary’]’ returned non-zero exit status 1.

My lm.arpa is created succesfully but it crashed when creating lm.binary. Do you have any idea about ?

Thank’s for your help

I’d suggest you start by trying to get more details about the error by running the command that creates the lm_binary directly (the script will guide you as to what it’s doing so take a look in there to figure out what you need to try running in the terminal)

With that info you may be able to figure it out right away or it might be something you can Google around how KenLM works to figure out.

Also I would suggest that you confirm you can generate the official scorer first, before branching off to make your own one, because if you know you can use the script to make the official one you’ll have some confidence your setup is workable (right now you don’t know for sure it works and you’ve tried something new with it, so your ability to isolate your problem is reduced). I realise it’s tempting to try your own new thing, and people often are keen to run before they can walk :slightly_smiling_face:

Anyway, I’m sure that by being methodical you can figure it out yourself. Best of luck!

Thank’s @nmstoker for your fast reply. I got my error and everything works well.

Thank’s again !

1 Like

Hey kamil_BENTOUNES I am getting same error as you…can you help me what changes you do to solve that

thank you

Hello @Ahmad_Ali1,

Sorry I don’t really remember what I did to solve my error ! But here the commands I used to generate the .scorer file:

sudo python3 /path/to/generate_lm.py --input_txt /path/to/vocabulary.txt --output_dir ./path/to/output --top_k 500 --kenlm_bins /path/to/kenlm/build/bin --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback

Then:

/path/to/native_client/generate_scorer_package --alphabet /path/to/alphabet.txt --lm /path/to/lm.binary --vocab /path/to/vocabulary.txt --package /path/to/output/use_case_eval.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

I hope it will help you !

1 Like

Thanks for your helpful reply

1 Like

Hi every one
i am create custom language model. i am use deepspeech 0.7.4 . i have 1 hour sound. i am create my own scorer file . i am training 500 epoch.
after use mic_vad_streaming.py but my model worked uncorrectly.

%cd /content/DeepSpeech/

! python3 DeepSpeech.py \

–train_files /content/drive/MyDrive/sound/train2.csv \

–dev_files /content/drive/MyDrive/sound/dev.csv \

–test_files /content/drive/MyDrive/sound/test.csv \

–train_batch_size 1 \

–test_batch_size 1 \

–n_hidden 100 \

–epochs 500 \

–checkpoint_dir /content/drive/MyDrive/checkpoint3 \

–export_dir /content/drive/MyDrive/model \

–alphabet_config_path /content/drive/MyDrive/files/alphabet.txt \

–scorer /content/drive/MyDrive/files/kenlm.scorer\

–learning_rate 0.001\

it is my giperparametrs

The link “here” only leads to GitHub generate_lm.py page. No demo how the training was done.

The language model training is covered in the DeepSpeech PlayBook

Yes, thanks! I was just curious to see a practical demonstration of the procedure.
I will need to dig into this.