Update on this -
Seems like these guys were having the same issue - Training Traditional Chinese for Common Voice using Deep Speech - #17 by othiele
I used thier ‘ignore’ solution and added a few debug statements in - /usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py
def Decode(self, input): '''Decode a sequence of labels into a string.''' res = super(UTF8Alphabet, self).Decode(input) print("utf8 Decode function called") print(res) return res.decode('utf-8','ignore')
My test csv has only 1 record
When i test the following logs get printed -
root@6e061f9543ba:/DeepSpeech# python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 1 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/ I Loading best validating checkpoint from /home/anon/Downloads/jaSTTDatasets/checkpoint/best_dev-13477 I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel I Loading variable from checkpoint: global_step I Loading variable from checkpoint: layer_1/bias I Loading variable from checkpoint: layer_1/weights I Loading variable from checkpoint: layer_2/bias I Loading variable from checkpoint: layer_2/weights I Loading variable from checkpoint: layer_3/bias I Loading variable from checkpoint: layer_3/weights I Loading variable from checkpoint: layer_5/bias I Loading variable from checkpoint: layer_5/weights I Loading variable from checkpoint: layer_6/bias I Loading variable from checkpoint: layer_6/weights Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv Test epoch | Steps: 0 | Elapsed Time: 0:00:00 utf8 Decode function called b'\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\x8b\xe3\x80\x82' utf8 Decode function called b'\xe3\x81\x93\xe3\x81\xae\xe6\x96\x99\xe7\x90\x86\xe3\x81\xaf\xe5\x8d\xb5\xe3\x82\x92\xe4\xba\x8c\xe5\x80\x8b\xe4\xbd\xbf\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82' Test epoch | Steps: 1 | Elapsed Time: 0:00:21 Test on /home/anon/Downloads/jaSTTDatasets/final-test.csv - WER: 1.000000, CER: 0.928571, loss: 116.681183 -------------------------------------------------------------------------------- Best WER: -------------------------------------------------------------------------------- WER: 1.000000, CER: 0.928571, loss: 116.681183 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav - src: "この料理は卵を二個使います。" - res: "か。" -------------------------------------------------------------------------------- Median WER: -------------------------------------------------------------------------------- WER: 1.000000, CER: 0.928571, loss: 116.681183 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav - src: "この料理は卵を二個使います。" - res: "か。" -------------------------------------------------------------------------------- Worst WER: -------------------------------------------------------------------------------- WER: 1.000000, CER: 0.928571, loss: 116.681183 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav - src: "この料理は卵を二個使います。" - res: "か。" --------------------------------------------------------------------------------
When you encode the hex from the first call using UTF-8 encoder/decoder its invalid, however when you encode the second call it gets encoded to - この料理は卵を二個使います。 which matches the transcript in my csv.
I am assuming the function is called once to decode the output predicted by the model and a second time to decode the transcript in the csv - this supports my hypothesis.