Tune MoziilaDeepSpeech to recognize specific sentences

nmstoker · September 15, 2020, 6:12pm

@Yugandhar_Gantala your post doesn’t seem to have enough information to investigate further. Can you give a bit more detail on what you’re actually doing, versions, environment etc. It looks like you’ve called some code and you haven’t passed the parameters.

Imagine I can’t see what you’re doing (because I cannot )

othiele · September 16, 2020, 9:18am

Please search the forum and if you post give us more to work on. There are several posts on building the scorer.

sujithvoona2 · September 16, 2020, 10:08am

Hey @nmstoker,
I have created my own vocabulary.txt file and I want to train deepspeech 0.7.3 pretrained model on my own vocabulary. I did follow the steps you mentioned above. Now I am trying to generate the output files (lm.binary, warpa.words, trie), while generating I am getting an error that the arguments are required “–vocabulary.txt, --output_dir, --top_k, --kenlm_bins, --arpa_order, --max_arpa_memory, --arpa_prune, --binary_a_bits, --binary_q_bits, --binary_type”.
What is generate_trie for? We will be getting an output file trie right, isn’t that enough to train the model on vocabulary.txt

nmstoker · September 16, 2020, 10:27am

@sujithvoona2 I think the issue is that since writing the above instructions the process has changed, so what is above won’t work with 0.7.x - as @othiele suggests, it’s best to look over the forum for how to build the scorer. You could also refer to the documentation for your corresponding version here: https://deepspeech.readthedocs.io/en/v0.7.3/

lissyx · September 16, 2020, 10:29am

And more specifically, https://deepspeech.readthedocs.io/en/v0.7.3/Scorer.html

reza · October 2, 2020, 8:51am

othiele · October 3, 2020, 12:07pm

Again, you hijack older threads. You have to build most of the system yourself. DeepSpeech only does speech to text. The rest is your part.

reza · October 3, 2020, 12:14pm

thanks .

https://stackoverflow.com/questions/64183342/commant-to-ddep-speech-by-deep-speech-to-farsi-data-set/64183407#64183407

kamil_BENTOUNES · March 10, 2021, 3:35pm

Hello @nmstoker,

I want to create my own scorer file. When executing the generate_lm.py script, I have this output:

CBuilding lm.binary …
./data/lm/lm.binary
Usage: ./kenlm/build/bin/build_binary [-u log10_unknown_probability] [-s] [-i] [-v] [-w mmap|after] [-p probing_multiplier] [-T trie_temporary] [-S trie_building_mem] [-q bits] [-b bits] [-a bits] [type] input.arpa [output.mmap]

-u sets the log10 probability for if the ARPA file does not have one.
Default is -100. The ARPA file will always take precedence.
-s allows models to be built even if they do not have ~~and~~ .
-i allows buggy models from IRSTLM by mapping positive log probability to 0.
-v disables inclusion of the vocabulary in the binary file.
-w mmap|after determines how writing is done.
mmap maps the binary file and writes to it. Default for trie.
after allocates anonymous memory, builds, and writes. Default for probing.
-r “order1.arpa order2 order3 order4” adds lower-order rest costs from these
model files. order1.arpa must be an ARPA file. All others may be ARPA or
the same data structure as being built. All files must have the same
vocabulary. For probing, the unigrams must be in the same order.

type is either probing or trie. Default is probing.

probing uses a probing hash table. It is the fastest but uses the most memory.
-p sets the space multiplier and must be >1.0. The default is 1.5.

trie is a straightforward trie with bit-level packing. It uses the least
memory and is still faster than SRI or IRST. Building the trie format uses an
on-disk sort to save memory.
-T is the temporary directory prefix. Default is the output file name.
-S determines memory use for sorting. Default is 80%. This is compatible
with GNU sort. The number is followed by a unit: % for percent of physical
memory, b for bytes, K for Kilobytes, M for megabytes, then G,T,P,E,Z,Y.
Default unit is K for Kilobytes.
-q turns quantization on and sets the number of bits (e.g. -q 8).
-b sets backoff quantization bits. Requires -q and defaults to that value.
-a compresses pointers using an array of offsets. The parameter is the
maximum number of bits encoded by the array. Memory is minimized subject
to the maximum, so pick 255 to minimize memory.

-h print this help message.

Get a memory estimate by passing an ARPA file without an output file name.
Traceback (most recent call last):
File “./data/lm/generate_lm.py”, line 211, in
main()
File “./data/lm/generate_lm.py”, line 202, in main
build_lm(args, data_lower, vocab_str)
File “./data/lm/generate_lm.py”, line 127, in build_lm
“./data/lm/lm.binary”,
File “/usr/lib/python3.6/subprocess.py”, line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[’./kenlm/build/bin/build_binary’, ‘-a’, ‘255’, ‘-q’, ‘8’, ‘-v’, ‘tree’, ‘./data/lm/lm_filtered.arpa’, ‘./data/lm/lm.binary’]’ returned non-zero exit status 1.

My lm.arpa is created succesfully but it crashed when creating lm.binary. Do you have any idea about ?

Thank’s for your help

nmstoker · March 10, 2021, 6:38pm

I’d suggest you start by trying to get more details about the error by running the command that creates the lm_binary directly (the script will guide you as to what it’s doing so take a look in there to figure out what you need to try running in the terminal)

With that info you may be able to figure it out right away or it might be something you can Google around how KenLM works to figure out.

Also I would suggest that you confirm you can generate the official scorer first, before branching off to make your own one, because if you know you can use the script to make the official one you’ll have some confidence your setup is workable (right now you don’t know for sure it works and you’ve tried something new with it, so your ability to isolate your problem is reduced). I realise it’s tempting to try your own new thing, and people often are keen to run before they can walk

Anyway, I’m sure that by being methodical you can figure it out yourself. Best of luck!

kamil_BENTOUNES · March 11, 2021, 8:18am

Thank’s @nmstoker for your fast reply. I got my error and everything works well.

Thank’s again !

Ahmad_Ali1 · October 5, 2021, 7:20am

Hey kamil_BENTOUNES I am getting same error as you…can you help me what changes you do to solve that

thank you

kamil_BENTOUNES · October 5, 2021, 11:41am

Hello @Ahmad_Ali1,

Sorry I don’t really remember what I did to solve my error ! But here the commands I used to generate the .scorer file:

sudo python3 /path/to/generate_lm.py --input_txt /path/to/vocabulary.txt --output_dir ./path/to/output --top_k 500 --kenlm_bins /path/to/kenlm/build/bin --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback

Then:

/path/to/native_client/generate_scorer_package --alphabet /path/to/alphabet.txt --lm /path/to/lm.binary --vocab /path/to/vocabulary.txt --package /path/to/output/use_case_eval.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

I hope it will help you !

Ahmad_Ali1 · October 6, 2021, 12:54am

Thanks for your helpful reply

Merdan_Bazarow · December 6, 2021, 10:25am

Hi every one
i am create custom language model. i am use deepspeech 0.7.4 . i have 1 hour sound. i am create my own scorer file . i am training 500 epoch.
after use mic_vad_streaming.py but my model worked uncorrectly.

Merdan_Bazarow · December 6, 2021, 10:26am

%cd /content/DeepSpeech/

! python3 DeepSpeech.py \

–train_files /content/drive/MyDrive/sound/train2.csv \

–dev_files /content/drive/MyDrive/sound/dev.csv \

–test_files /content/drive/MyDrive/sound/test.csv \

–train_batch_size 1 \

–test_batch_size 1 \

–n_hidden 100 \

–epochs 500 \

–checkpoint_dir /content/drive/MyDrive/checkpoint3 \

–export_dir /content/drive/MyDrive/model \

–alphabet_config_path /content/drive/MyDrive/files/alphabet.txt \

–scorer /content/drive/MyDrive/files/kenlm.scorer\

–learning_rate 0.001\

Merdan_Bazarow · December 6, 2021, 10:26am

it is my giperparametrs

joshoreefe · March 24, 2023, 5:21pm

The link “here” only leads to GitHub generate_lm.py page. No demo how the training was done.

kathyreid · March 25, 2023, 6:47am

The language model training is covered in the DeepSpeech PlayBook

joshoreefe · March 25, 2023, 10:40am

Yes, thanks! I was just curious to see a practical demonstration of the procedure.
I will need to dig into this.