Link to mandarin chinese text corpus

rajpuneet.sandhu · March 22, 2021, 7:22pm

I wanted to generate my own scorer for mandarin chinese, but I could not find a link to the text corpus used, can someone provide that?

lissyx · March 23, 2021, 12:41pm

As much as I remember, those sources are not under a license that allows redistribution, but @reuben knows better

rajpuneet.sandhu · March 23, 2021, 8:59pm

@lissyx could you provide the alphabet.txt?

rajpuneet.sandhu · March 24, 2021, 2:08pm

@reuben could you please provide the chinese alphabet file that was used for training the model so that I can generate my own scorer from some other source?

lissyx · March 24, 2021, 2:25pm

I dont think the Mandarin model used an alphabet, since we relied on it to work on the all bytes mode, which allows to work without an alphabet.

rajpuneet.sandhu · March 24, 2021, 4:16pm

I used the command for ‘generate_lm.py’ that has been mentioned in the documentation but it is giving me an error. Did you use different parameter values?

(venv) rsandhu@rsandhu-XPS-15-9570:~/deepspeech_v093/app_scorer/x$ python3 /home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py --input_txt /home/rsandhu/deepspeech_v093/app_scorer/x/text-corpus.txt --output_dir /home/rsandhu/deepspeech_v093/app_scorer/x \
>   --top_k 500000 --kenlm_bins /home/rsandhu/kenlm/build/bin/ \
>   --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
>   --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Converting to lowercase and counting word occurrences ...
| |                       #                                                                                                              | 450000 Elapsed Time: 0:00:02

Saving top 500000 words ...

Calculating word statistics ...
  Your text file has 476495 words in total
  It has 357646 unique words
  Your top-500000 words are 100.0000 percent of all words
  Your most common word "作用" occurred 641 times
  The least common word in your top-k is "电子菜单的作用" with 1 times
  The first word with 2 occurrences is "吃什么可以瘦腹部" at place 68708

Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /home/rsandhu/deepspeech_v093/app_scorer/x/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 476495 types 357649
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:4291788 2:2759303424 3:5173694464 4:8277910528 5:12071953408
Statistics:
1 357649 D1=0.992624 D2=1.00211 D3+=0.933904
2 722263 D1=0.866346 D2=1.3764 D3+=2.01893
3 70010/383869 D1=0.758056 D2=1.47425 D3+=2.20213
4 3618/21882 D1=0.776679 D2=1.55498 D3+=2.20043
5 273/1738 D1=0.778841 D2=1.47204 D3+=2.33716
Memory estimate for binary LM:
type    MB
probing 27 assuming -p 1.5
probing 33 assuming -r models -p 1.5
trie    17 without quantization
trie    13 assuming -q 8 -b 8 quantization 
trie    16 assuming -a 22 array pointer compression
trie    11 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:4291788 2:11556208 3:1400200 4:86832 5:7644
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
#**************#####################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:4291788 2:11556208 3:1400200 4:86832 5:7644
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:27822896 kB	VmRSS:31852 kB	RSSMax:4921480 kB	user:1.21332	sys:1.2535	CPU:2.46689	real:2.50022

Filtering ARPA file using vocabulary of top-k words ...
Reading /home/rsandhu/deepspeech_v093/app_scorer/x/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************

Building lm.binary ...
Reading /home/rsandhu/deepspeech_v093/app_scorer/x/lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*******************************************************************************************Traceback (most recent call last):
  File "/home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py", line 210, in <module>
    main()
  File "/home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py", line 201, in main
    build_lm(args, data_lower, vocab_str)
  File "/home/rsandhu/deepspeech_v093/DeepSpeech/data/lm/generate_lm.py", line 126, in build_lm
    binary_path,
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/rsandhu/kenlm/build/bin/build_binary', '-a', '255', '-q', '8', '-v', 'trie', '/home/rsandhu/deepspeech_v093/app_scorer/x/lm_filtered.arpa', '/home/rsandhu/deepspeech_v093/app_scorer/x/lm.binary']' died with <Signals.SIGSEGV: 11>.

rajpuneet.sandhu · March 25, 2021, 2:07pm

@lissyx @reuben could you share the commands that you used for chinese?

lissyx · March 25, 2021, 2:08pm

I have not worked on that. Reuben is busy.

rajpuneet.sandhu · March 25, 2021, 8:08pm

audio file name	audio duration (sec)	0.9.3 native client arm64 cpu android (sec)
who_was_the_first_president_of_america.wav	2	46.5
when_is_the_next_long_weekend.wav	2	51.6
when_is_the_independence_day.wav	1	34

./deepspeech --model /storage/emulated/10/Android/data/com.visteon.sns.app/files/sns/asr/zh-CN/output_graph.tflite \
--scorer /storage/emulated/10/Android/data/com.visteon.sns.app/files/sns/asr/zh-CN/kenlm.scorer \
--beam_width 1024 --lm_alpha 0.6940122363709647 --lm_beta 4.777924224113021 -t \
--audio /data/local/tmp/zh_CN_test_data/who_was_the_first_president_of_america.wav

I tried the new mandarin model and scorer with default params but the inference time is a lot. The hardware is android arm64 cpu.
But, this is not the case for english:

audio file name (en-US)	audio duration (sec)	0.9.3 native client arm64 cpu android
2830-3980-0043.wav	2	3.20409
4507-16021-0012.wav	3	4.31
8455-210777-0068.wav	3	4.177576667

./deepspeech --model /home/rsandhu/deepspeech_v093/native_client_093.arm64.cpu.android/deepspeech-0.9.3-models.tflite --scorer /home/rsandhu/deepspeech_v093/native_client_093.arm64.cpu.android/deepspeech-0.9.3-models.scorer --audio /home/rsandhu/Downloads/audio-0.6.1/audio/

lissyx · March 25, 2021, 11:46pm

This is why it is advertised as experimental …

MaarufB · January 25, 2022, 6:46am

Hello bro, I want to ask you about chinese text format language model or scorer in deepspeech. Should I separate each text with space? Sorry I am not a chinese but this is for research purpose. Thanks in advance!

Topic		Replies	Views
Question regarding the new scorer function instead of LM+trie DeepSpeech	8	826	May 20, 2020
How can i process Chinese Mandarin speech recognition DeepSpeech	7	2504	August 21, 2020
Training own scorer for Deepspeech 0.7.4 DeepSpeech	4	381	December 8, 2020
Error when creating my own scorer file DeepSpeech	9	1557	June 22, 2020
Doesn't look like a character based (Bytes Are All You Need) model DeepSpeech	2	763	March 19, 2021

Link to mandarin chinese text corpus

Related topics