Error in testing model on test.csv when training zh-CN(chInese model)

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository) : No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04) : Linux Ubuntu 18.04.1 LTS
  • TensorFlow installed from (our builds, or upstream TensorFlow) : Using Docker
  • TensorFlow version (use command below) : Using Docker
  • Python version : Using Docker
  • Bazel version (if compiling from source) : Docker
  • GCC/Compiler version (if compiling from source) : Docker
  • CUDA/cuDNN version : Docker
  • GPU model and memory : GeForce GTX 1650/PCIe/SSE2
  • Exact command to reproduce : Provided below

This is my command

root@e658b51810f6:/DeepSpeech# python3 DeepSpeech.py
–train_files deepspeech-data/cv-corpus-6.1-2020-12-11/zh-CN/clips/train.csv
–dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/zh-CN/clips/dev.csv
–test_files deepspeech-data/cv-corpus-6.1-2020-12-11/zh-CN/clips/test.csv
–checkpoint_dir deepspeech-data/checkpoints --export_dir deepspeech-data/exported-model --n_hidden 256 --reduce_lr_on_plateau true --plateau_epochs 8 --plateau_reduction 0.08 --early_stop true --es_epochs 10 --es_min_delta 0.06 --dropout_rate 0.4 --bytes_output_mode --automatic_mixed_precision --train_batch_size 128 --dev_batch_size 128 --test_batch_size 128 --lm_alpha 0.6940122363709647 --lm_beta 4.777924224113021 --epochs 1

the logs of the error recieved

Testing model on deepspeech-data/cv-corpus-6.1-2020-12-11/zh-CN/clips/test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00 Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 982, in run_script
absl.app.run(main)
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 300, in run
_run_main(main, args)
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/DeepSpeech/training/deepspeech_training/train.py”, line 958, in main
test()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 682, in test
samples = evaluate(FLAGS.test_files.split(’,’), create_model)
File “/DeepSpeech/training/deepspeech_training/evaluate.py”, line 132, in evaluate
samples.extend(run_test(init_op, dataset=csv))
File “/DeepSpeech/training/deepspeech_training/evaluate.py”, line 114, in run_test
cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/init.py”, line 228, in ctc_beam_search_decoder_batch
for beam_results in batch_beam_results
File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/init.py”, line 228, in
for beam_results in batch_beam_results
File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/init.py”, line 227, in
[(res.confidence, alphabet.Decode(res.tokens)) for res in beam_results]
File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/init.py”, line 138, in Decode
return res.decode(‘utf-8’)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe5 in position 0: invalid continuation byte

After I searched ,i found the solution from @ yang_jiao

change ds_ctcdecoder init .py decode function
def Decode(self, input):
‘’‘Decode a sequence of labels into a string.’’’
res = super(UTF8Alphabet, self).Decode(input)
return res.decode(‘utf-8’,‘ignore’)

But when I want to revise the file in a container of docker , i find that the file is empty

root@e658b51810f6:/DeepSpeech# cd /usr/local/lib/python3.6/dist-packages/ds_ctcdecoder
root@e658b51810f6:/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder# vim init.py

How can i solve the problem?

BTW, another solution is using sorcer
So i use the zh-CN sorcer

root@e658b51810f6:/DeepSpeech# python3 DeepSpeech.py --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/zh-CN/clips/train.csv --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/zh-CN/clips/dev.csv --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/zh-CN/clips/test.csv --checkpoint_dir deepspeech-data/checkpoints --export_dir deepspeech-data/exported-model --n_hidden 256 --reduce_lr_on_plateau true --plateau_epochs 8 --plateau_reduction 0.08 --early_stop true --es_epochs 10 --es_min_delta 0.06 --dropout_rate 0.4 --bytes_output_mode --automatic_mixed_precision --train_batch_size 128 --dev_batch_size 128 --test_batch_size 128 --lm_alpha 0.6940122363709647 --lm_beta 4.777924224113021 --epochs 1
–scorer_path deepspeech-data
–scorer deepspeech-0.9.3-models-zh-CN.scorer

The error i received

Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 982, in run_script
absl.app.run(main)
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 300, in run
_run_main(main, args)
File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/DeepSpeech/training/deepspeech_training/train.py”, line 949, in main
early_training_checks()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 934, in early_training_checks
FLAGS.scorer_path, Config.alphabet)
File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/init.py”, line 36, in init
raise ValueError(‘Scorer initialization failed with error code 0x{:X}’.format(err))
ValueError: Scorer initialization failed with error code 0x2005

Can you fix your console output and use proper formatting for easing reading?

fine…
Scorer problem happened because of the wrong socer path
The right path is

–scorer_path deepspeech-data/deepspeech-0.9.3-models-zh-CN.scorer

But I still don’t understand why the init file is empty

This is likely the wrong solution.

I don’t know

From https://github.com/mozilla/DeepSpeech/blob/7450e5763b2af8f6804205c60b8fd9a0b4cec7db/native_client/ctcdecode/init.py#L91-L138 you need to explain how you produced the scorer, because it seems it’s just wrong.

Thank you for your answer
I just succeed in using scorer , the problem of UnicodeDecodeError can be solved.

appreciate for your time

I’m not sure I get your point here: have you figured out the problem? Or are you still trying to fix?

If it’s still needing a fix, then you need to document how you built your scorer.

I am still trying to fix.
I tried using the deepspeech-0.9.3-models-zh-CN.scorer which is provided by the DeepSpeech github.
But sometimes it works, sometimes it will cause other errors.
For example:

File “/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/init.py”, line 138, in Decode
return res.decode(‘utf-8’)
UnicodeDecodeError: ‘utf-8’ codec can’t decode bytes in position 15-16: unexpected end of data

Above is the error i can’t fix

I think the error is about the scorer which i should build by myself.
So what dataset should i use to build my scorer?

As i said please share your build steps.

It should help you.