Training Chinese model

I have used the deepspeech code to train the Chinese model for some time. Now, I got some results which are not so good but are much better than before. My experience may be useful for others.

  1. The transcript of the training data should be divided by blank like “我 的 名 字 是”(be divided into characters). And to train the language model, the data of the corpus should also be divided into characters.
  2. –n_hidden 512 --learning_rate 0.0001 may be a good choice.(After about 20 epoch, the loss of the my training set can be lower than 6)
  3. The size of my training data is about 700 hour.
  4. The WER value of my test data is about 20%.

Thanks for taking time to provide valuable feedback like that ! I’m a bit surprised though about your geometry of 512, our experiments (but on english, so maybe that does explain the difference) would show that we need bigger geometry over bigger dataset to achieve < 7%.

Are you intending to contribute to Common Voice for Chinese ? That might help you get more training data :slight_smile:

Hello, @jackhuang!
Thank you for sharing.
Can you please describe what is your training data? Is it noisy or clean and how many speakers?
What change did the most significant decrease of WER for you?

I use the datasets which include:
“RASC863-G2――六大方言地方普通话语音语料库-朗读部分(粗标库)” ,
“CASIA南方口音语音库”,“CASIA北方口音语音库" and “THCHS30”.
And there are about 3000 speakers in my training data.
You can learn the introduction of these datasets from:
2. When I change the learning rate to 0.0001, the WER decreases a lot.

1 Like

Hi @jackhuang, have you trained with “” dataset.
I got this error;

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/", line 916, in _bootstrap_inner
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/data/jugs/asr/DeepSpeech/util/", line 146, in _populate_batch_queue
    source = audiofile_to_input_vector(wav_file, self._model_feeder.numcep, self._model_feeder.numcontext)
  File "/data/jugs/asr/DeepSpeech/util/", line 66, in audiofile_to_input_vector
    fs, audio =
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/scipy/io/", line 233, in read
    fid = open(filename, 'rb')
TypeError: integer argument expected, got float

However, all wave file seems mono, with 16-bit.
I checked with ;
[ 20 -467 -825 … -81 -141 -233]
[-161 -151 -151 … 44 126 187]
[-292 -291 -261 … -123 -71 -51]
[-124 -120 -106 … 8 38 29]

and also with sox command, seems to be all ok.

Input File     : 'train/C8_749.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:07.88 = 126000 samples ~ 590.625 CDDA sectors
File Size      : 252k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Any, preprocessing require for those wave files ?

Hi @jackhuang, As you mention earlier, the transcript is divided by space on each characters. can you tell me what alphabets did u use to generate trie file. And, I believe your vocabulary contains the Chinese characters but not the pinyin, to train your language model. Also, can you tell me what ‘n’ did u use in n-gram model.

  1. The alphabets contains about 6800 characters. The alphabets is generated by the corpus (which contains 10 million 600 thousand sentences,322 million 540 thousand characters from the Internet). It seems that the characters is widely used.
  2. Yes, the vocabulary only contains Chinese characters.
  3. I have used 3-gram and 4-gram. It seems that 4-gram is better.

I have used the “THCHS-30” data set. The audio data can be used without any transformation.
I used the 0.1.1 version(Tensorflow 1.4.0, python 2.7). I guess you need to modify the code which is in charge of reading audio.

1 Like

Thanks @jackhuang . So, the alphabets gonna be a huge. Thanks again.

Recently, I have used the geometry of 1024. The training process is faster(convergence needs less epoches), but the model is over fitting(The WER decrease on some test sets, and increase on some other test sets).
I wish I can do my job better and share my results at first. Thanks for your kindness. I have got a lot of help from the project DeepSpeech, thank you very much!


Yes, so the decoding process is slow, I set the beam width to 100. And the speed of decoding is about 4x (using Nvidia GTX 1080Ti). It seems that the warp-ctc can solve that problem, but I haven’t used it successfully. Maybe you can try it.


I’d guess that a width of 1024 is insufficient for Chinese.

For intuition on widths see the results here[1] for English, which is an easier case. In particular the “issue1241 (LSTM BRNN Width)” tab

1 Like

请教一下: loss 停在 230或207左右收敛不下去的,后来怎么解决的啊?

Hi, @jackhuang your alphabets are really huge. It may cause a lot of time for testing.
In my case, I’m using around 100 hours data as training data set, and it take around 280secs for 10 secs. audio.
Did you do any optimization? Otherwise, It’s really a problem for me.

Traceback (most recent call last):
File ./DeepSpeech-0.4.1/util/", line 37, in label_from_string
return self._str_to_label[string]
KeyError: ‘我’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “./”, line 941, in
File “/home/xu/.local/lib/python3.5/site-packages/tensorflow/python/platform/”, line 125, in run
File “./”, line 893, in main
File “./”, line 388, in train
File “/home/xu/DeepSpeech-0.4.1/util/”, line 69, in preprocess
out_data = pmap(step_fn, source_data.iterrows())
File “/home/xu/DeepSpeech-0.4.1/util/”, line 13, in pmap
results =, iterable)
File “/usr/lib/python3.5/multiprocessing/”, line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File “/usr/lib/python3.5/multiprocessing/”, line 608, in get
raise self._value
File “/usr/lib/python3.5/multiprocessing/”, line 119, in worker
result = (True, func(*args, **kwds))
File “/usr/lib/python3.5/multiprocessing/”, line 44, in mapstar
return list(map(*args))
File “/home/xu/DeepSpeech-0.4.1/util/”, line 23, in process_single_file
transcript = text_to_char_array(file.transcript, alphabet)
File “/home/xu/DeepSpeech-0.4.1/util/”, line 68, in text_to_char_array
return np.asarray([alphabet.label_from_string© for c in original])
File “/home/xu/DeepSpeech-0.4.1/util/”, line 68, in
return np.asarray([alphabet.label_from_string© for c in original])
File “/home/xu/DeepSpeech-0.4.1/util/”, line 48, in label_from_string
).with_traceback(e. traceback )
File “/home/xu/DeepSpeech-0.4.1/util/”, line 37, in label_from_string
return self._str_to_label[string]
KeyError: '\n ERROR: You have characters in your transcripts\n which do not occur in your data/alphabet.txt\n file. Please verify that your alphabet.txt\n contains all neccessary characters. Use\n util/ to see what characters are in\n your train / dev / test transcripts.\n

The trained csv is in Chinese. I have double checked the words. They are all included. Both alphabet.txt and the csv are coded in utf-8. What happened?

I have checked KeyError in self._str_to_label[string] of DeepSpeech/util/ when training own model

but failed.

@myrainbowandsky In future could you please not double, triple, quadruple… post the same problem.

hi,I trained “data_thcs30”.It can run and exported model file correctly,but it output nothing when i use the model to infer. I use the following command:

deepspeech --model /tmp/DeepSpeech/data_thchs30/model_201903201911/output_graph.pbmm
–alphabet /tmp/dataset/data_thchs30/csv/alphabet.txt
–lm /tmp/data_thchs30/csv/lm.binary
–trie /tmp/data_thchs30/csv/trie
–audio /tmp/data_thchs30/test/D8_999.wav

Hi Jack,

I am building a RPG AR Game. We need a special lady voice for our main role Miss U. Could u help us on the project ?

Thank you,
Harry Chen
Wetcat: chenyd00
Hp: +86 18050283030

deepspeech --model /tmp/DeepSpeech/data_thchs30/model_201903201911/output_graph.pbmm
–alphabet /tmp/dataset/data_thchs30/csv/alphabet.txt
–lm /tmp/data_thchs30/csv/lm.binary
–trie /tmp/data_thchs30/csv/trie
–audio /tmp/data_thchs30/test/D8_999.wav

@huangtianyu 您好,我想问一下,您运行这段代码的时候没有报错吗?
我运行这段代码的时候 :
Invalid argument: No OpKernel was registered to support Op ‘Slice’ with these attrs. Registered devices: [CPU,GPU], Registered kernels:

     [[{{node Slice}} = Slice[Index=DT_INT32, T=DT_INT32](Shape_1, Slice/begin, Slice/size)]]

I was successfully run the model but got the result:
Test - WER: 1.000000, CER: 5.000000, loss: 0.973997

WER: 1.000000, CER: 5.000000, loss: 0.973997

  • src: "您好在吗 "
  • res: “”

With the space separated for each character got the same:

Test - WER: 1.000000, CER: 8.000000, loss: 0.967288

WER: 1.000000, CER: 8.000000, loss: 0.967288

  • src: "您 好 在 吗 "
  • res: “”

Anything I have missed?