I wanted to generate my own scorer for mandarin chinese, but I could not find a link to the text corpus used, can someone provide that?
As much as I remember, those sources are not under a license that allows redistribution, but @reuben knows better
@reuben could you please provide the chinese alphabet file that was used for training the model so that I can generate my own scorer from some other source?
I dont think the Mandarin model used an alphabet, since we relied on it to work on the all bytes mode, which allows to work without an alphabet.
I used the command for ‘generate_lm.py’ that has been mentioned in the documentation but it is giving me an error. Did you use different parameter values?
(venv) rsandhu@rsandhu-XPS-15-9570:~/deepspeech_v093/app_scorer/x$ python3 /home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py --input_txt /home/rsandhu/deepspeech_v093/app_scorer/x/text-corpus.txt --output_dir /home/rsandhu/deepspeech_v093/app_scorer/x \
> --top_k 500000 --kenlm_bins /home/rsandhu/kenlm/build/bin/ \
> --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
> --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
Converting to lowercase and counting word occurrences ...
| | # | 450000 Elapsed Time: 0:00:02
Saving top 500000 words ...
Calculating word statistics ...
Your text file has 476495 words in total
It has 357646 unique words
Your top-500000 words are 100.0000 percent of all words
Your most common word "作用" occurred 641 times
The least common word in your top-k is "电子菜单的作用" with 1 times
The first word with 2 occurrences is "吃什么可以瘦腹部" at place 68708
Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /home/rsandhu/deepspeech_v093/app_scorer/x/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 476495 types 357649
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:4291788 2:2759303424 3:5173694464 4:8277910528 5:12071953408
Statistics:
1 357649 D1=0.992624 D2=1.00211 D3+=0.933904
2 722263 D1=0.866346 D2=1.3764 D3+=2.01893
3 70010/383869 D1=0.758056 D2=1.47425 D3+=2.20213
4 3618/21882 D1=0.776679 D2=1.55498 D3+=2.20043
5 273/1738 D1=0.778841 D2=1.47204 D3+=2.33716
Memory estimate for binary LM:
type MB
probing 27 assuming -p 1.5
probing 33 assuming -r models -p 1.5
trie 17 without quantization
trie 13 assuming -q 8 -b 8 quantization
trie 16 assuming -a 22 array pointer compression
trie 11 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:4291788 2:11556208 3:1400200 4:86832 5:7644
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
#**************#####################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:4291788 2:11556208 3:1400200 4:86832 5:7644
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:27822896 kB VmRSS:31852 kB RSSMax:4921480 kB user:1.21332 sys:1.2535 CPU:2.46689 real:2.50022
Filtering ARPA file using vocabulary of top-k words ...
Reading /home/rsandhu/deepspeech_v093/app_scorer/x/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Building lm.binary ...
Reading /home/rsandhu/deepspeech_v093/app_scorer/x/lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*******************************************************************************************Traceback (most recent call last):
File "/home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py", line 210, in <module>
main()
File "/home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py", line 201, in main
build_lm(args, data_lower, vocab_str)
File "/home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py", line 126, in build_lm
binary_path,
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/rsandhu/kenlm/build/bin/build_binary', '-a', '255', '-q', '8', '-v', 'trie', '/home/rsandhu/deepspeech_v093/app_scorer/x/lm_filtered.arpa', '/home/rsandhu/deepspeech_v093/app_scorer/x/lm.binary']' died with <Signals.SIGSEGV: 11>.
I have not worked on that. Reuben is busy.
audio file name | audio duration (sec) | 0.9.3 native client arm64 cpu android (sec) |
---|---|---|
who_was_the_first_president_of_america.wav | 2 | 46.5 |
when_is_the_next_long_weekend.wav | 2 | 51.6 |
when_is_the_independence_day.wav | 1 | 34 |
./deepspeech --model /storage/emulated/10/Android/data/com.visteon.sns.app/files/sns/asr/zh-CN/output_graph.tflite \
--scorer /storage/emulated/10/Android/data/com.visteon.sns.app/files/sns/asr/zh-CN/kenlm.scorer \
--beam_width 1024 --lm_alpha 0.6940122363709647 --lm_beta 4.777924224113021 -t \
--audio /data/local/tmp/zh_CN_test_data/who_was_the_first_president_of_america.wav
I tried the new mandarin model and scorer with default params but the inference time is a lot. The hardware is android arm64 cpu.
But, this is not the case for english:
audio file name (en-US) | audio duration (sec) | 0.9.3 native client arm64 cpu android |
---|---|---|
2830-3980-0043.wav | 2 | 3.20409 |
4507-16021-0012.wav | 3 | 4.31 |
8455-210777-0068.wav | 3 | 4.177576667 |
./deepspeech --model /home/rsandhu/deepspeech_v093/native_client_093.arm64.cpu.android/deepspeech-0.9.3-models.tflite --scorer /home/rsandhu/deepspeech_v093/native_client_093.arm64.cpu.android/deepspeech-0.9.3-models.scorer --audio /home/rsandhu/Downloads/audio-0.6.1/audio/
This is why it is advertised as experimental …
Hello bro, I want to ask you about chinese text format language model or scorer in deepspeech. Should I separate each text with space? Sorry I am not a chinese but this is for research purpose. Thanks in advance!