Help with japanese model

Update on this -
Seems like these guys were having the same issue - Training Traditional Chinese for Common Voice using Deep Speech - #17 by othiele

I used thier ‘ignore’ solution and added a few debug statements in - /usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py

def Decode(self, input):
    '''Decode a sequence of labels into a string.'''
    res = super(UTF8Alphabet, self).Decode(input)
    print("utf8 Decode function called")
    print(res)
    return res.decode('utf-8','ignore')

My test csv has only 1 record

When i test the following logs get printed -

root@6e061f9543ba:/DeepSpeech# python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 1 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/
I Loading best validating checkpoint from /home/anon/Downloads/jaSTTDatasets/checkpoint/best_dev-13477
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                       utf8 Decode function called
b'\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\x8b\xe3\x80\x82'
utf8 Decode function called
b'\xe3\x81\x93\xe3\x81\xae\xe6\x96\x99\xe7\x90\x86\xe3\x81\xaf\xe5\x8d\xb5\xe3\x82\x92\xe4\xba\x8c\xe5\x80\x8b\xe4\xbd\xbf\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82'
Test epoch | Steps: 1 | Elapsed Time: 0:00:21                                                       
Test on /home/anon/Downloads/jaSTTDatasets/final-test.csv - WER: 1.000000, CER: 0.928571, loss: 116.681183
--------------------------------------------------------------------------------
Best WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------
Median WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------
Worst WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------

When you encode the hex from the first call using UTF-8 encoder/decoder its invalid, however when you encode the second call it gets encoded to - この料理は卵を二個使います。 which matches the transcript in my csv.

I am assuming the function is called once to decode the output predicted by the model and a second time to decode the transcript in the csv - this supports my hypothesis.

1 Like