Problem of train a Mandarin model

ganlantee · December 23, 2020, 11:39am

I want to train a Mandarin model:
my steps are as follows:

I download the data set from voice 6.1(chinese simple) ,decompressing this, make a soft link to data/Chinese
I run: ../bin/import_cv2.py ./Chinese/clips/ to create the csv file
I run: python -m deepspeech_training.util.check_characters -csv dev.csv,train-all.csv,train.csv,test.csv,validated.csv,other.csv -unicode -alpha > alphabet.txt to crate a alphabet.txt
I run:python3 DeepSpeech.py --train_files ./data/Chinese/clips/train.csv --dev_files ./data/Chinese/clips/dev.csv --test_files ./data/Chinese/clips/test.csv -epochs 1 --use_allow_growth true --save_checkpoint_dir ./result --alphabet_config_path data/alphabet.txt to train this model, I have replaced the data/alphabet.txt file.

but I get a problem like this:

ValueError: Cannot feed value of shape (29,) for Tensor ‘layer_6/bias/Initializer/zeros:0’, which has shape '(4884,)'

I read this tutorial-how-i-trained-a-specific-french-model-to-control-my-robot and find maybe the parameter of --lm_binary_path I have not set, but I can’t find this parameter after I run: ./DeepSpeech.py --helpfull.
I know this is alphabte.txt’s error, but my alphabte.txt is like this:
…
丞
纱
去
热
屈
迄
挠
闵
菠
锹
眼
晨
肤
樽
杂
牟
消
…
Stored in UTF-8 encoding. Is this alphabet.txt wrong?
I don’t know how to solve this problem.
my deepspeech version is v0.9.3
could some one can help me solve this problem?

othiele · December 23, 2020, 1:13pm

Thanks for opening a new post, happy to help.

Check the docs and search the forum here for UTF-8 mode, which you probably should use for Mandarin.

ganlantee · December 24, 2020, 9:41am

Thanks, but there is other error when I create the .scorer file,
the command I run is:lm/generate_scorer_package --alphabet alphabet.txt --lm lm.binary --vocab vocab-4883.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

The error is as follow:
4882 unique words read from vocabulary file.
Looks like a character based (Bytes Are All You Need) model.
–force_bytes_output_mode was not specified, using value infered from vocabulary contents: true
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Error loading language model file: Invalid magic in trie header.

I can’t install the build-in kenlm, many errors will report when I cmake it. so I download the latest version of kenlm from kenlm and install it. I use this version to create the lm.binary, the command is build_binary -T -s lm_filtered.arpa lm.binary

what should I do next? How can I solve this problem?
all steps are follow this guide External scorer scripts

othiele · December 24, 2020, 9:45am

Please read the error message, it tells you what to do. And start reading and understanding the docs.

Gone for the holidays, don’t expect an answer in the next couple of days from me.