Training Traditional Chinese for Common Voice using Deep Speech

Hi, I try to train Taiwan Chinese speech recognition using common voice dataset, I already finished the training and the loss is around 55 using this only common voice dataset. But for testing it taking really - really long time. I think that I did something wrong for generate the alphabet for Chinese resulting very large alphabet. I need your help:

  1. Could anyone provide step by step to generate alphabet in the correct way for Chinese? I read about UTF-8 in Deep Speech documentation but could not really understand it.

  2. Do we need to create language model to train Chinese Speech Recognition? If yes, how you generate the language model?

  3. I prefer to use Taiwanese datasets from common voice, if you have any pretrained model in Chinese it will really help me maybe I could do the transfer learning for train Taiwanese Dataset.

Thank you and sorry for the newbie questions. I am really stuck in this point now.

That’s kind of on purpose, this is really experimental until @reuben finishes some things (which are in progress as we speak), so there is few doc.

What you highlight is expected if you use alphabet with mandarin and similar languages

Yes. Please refer to the documentation, external scorer is covered.

We don’t have that yet.

I recommend waiting for the upcoming 0.9 release which should make things clearer/easier.

Thank you @lissyx and @reuben, I will wait for the upcoming 0.9 release then. I already trained using Taiwanese Common Voice Dataset got loss around 55 - 57 in 20 epochs (this dataset I know is too small). When I tried to do testing and inference, it taking really long - long time and output nothing, I believe this is not because of the datasets are too small, but I believe it also because of too large alphabet I generate in Chinese that consist of more than 2000 characters.

I am glad that it will continue to 0.9 release, what time the estimation of that version will come? btw, thank you very much for your all nice helps.

Hi, since v0.9.1 already released now I tried to train using --bytes_output_mode using common voice Taiwanese datasets. Training going well but the losses still more than 100, I also used the recommended setting for lm_alpha and lm_beta. Here is how I am training:


> python --train_files ./data/CV/zh-TW/clips/train-all.csv --dev_files ./data/CV/zh-TW/clips/dev.csv --test_files ./data/CV/zh-TW/clips/test.csv -epochs 1 --export_dir ./model_result --train_cudnn true --use_allow_growth true --save_checkpoint_dir ./checkpoint --load_checkpoint_dir ./checkpoint --bytes_output_mode –lm_alpha 0.6940122363709647 --lm_beta 4.777924224113021

When going to testing, an error happened like this:

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: unexpected end of data

What step did I miss, we already do not need alphabet if we already use --bytes_output_mode, right? Thank you for your help.


i get the same error


Hi yang_jiao, do you already find a way to handle this error? Thank you

use s.decode(‘utf-8’,‘ignore’)

1 Like

How do you change decode in training process? because after epochs end, it will go to testing if we specify the data test. I got some error when it goes to data test.

Just don’t give a test set, then it doesn’t test :slight_smile: And you don’t get the error.

change ds_ctcdecoder decode function
def Decode(self, input):
‘’‘Decode a sequence of labels into a string.’’’
res = super(UTF8Alphabet, self).Decode(input)
return res.decode(‘utf-8’,‘ignore’)


Hi, I already did like your suggestion and get my .pbmm model, but when I tried to do inference, it takes really long time and outputs nothing. I trained using Taiwanese datasets from the common voice.

Do you face this kind of problem too? I am sure that dataset maybe is not enough to produce a good result but at least it could output some result right?

If you train just one epoch this is to be expected. Train for 15-20 epochs to get somewhat good results.

I am sorry, that was actually just for example, I already tried for training for around 15 epochs although the loss still decreases a little bit (maybe this is still acceptable if my model still output nothing). But, inference still taking a long time, but if I tried the inference with deepspeech-0.9.1-models-zh-CN.pbmm it not taking that long.

How about the inference time? is it also because the dataset still not enough, or I still have something missing to use byte output mode?

It is hard to help if you don’t give us information. We are not magicians.

As lissyx said above, this is still experimental and I guess your latency stems from there if a regular model runs a lot faster.

Supply us with much more info on what your current setup is and we can give you better answers.

This is my setup for training:
RTX 2080 Super with Max Q Design

I trained using Common Voice Taiwanese Dataset:
VERSION : zh-TW_73h_2020-06-22

Here is how I trained:

> python --train_files ./data/CV/zh-TW/clips/train-all.csv --dev_files ./data/CV/zh-TW/clips/dev.csv --test_files ./data/CV/zh-TW/clips/test.csv -epochs 15 --export_dir ./model_result --train_cudnn true --use_allow_growth true --save_checkpoint_dir ./checkpoint --load_checkpoint_dir ./checkpoint --bytes_output_mode –lm_alpha 0.6940122363709647 --lm_beta 4.777924224113021

The last best-validated loss I got is 104.5232

Here is my inference result:

Loading model from file model_result/output_graph.pbmm
		Running inference.
		Inference took 323.130s for 3.552s audio file.

Here is the inference result using deepspeech-0.9.1-models-zh-CN.pbmm

Loading model from file model_result/deepspeech-0.9.1-models-zh-CN.pbmm
Loaded model in 0.00576s.
Running inference.
Inference took 6.231s for 3.552s audio file.


I am sorry did not providing much detail before, thank you very much for your help.

This indeed way too long for any use. Not sure why that happens, @lissyx do you have an idea?

Without even sharing the command line used to run the inference, nor the model sizes, I have no idea. Not even to mention that @reuben worked on that part, not me.

Hi this is how I am doing inference:

> deepspeech --model model_result/output_graph.pbmm --audio data/CV/zh-TW/clips/common_voice_zh-TW_20290083.wav

My .pbmm’s size is :
209.5 MB