DeepSpeech with another language UnicodeDecodeError

Stanislavs_Davidovics · March 13, 2020, 9:33am

Hello,

I am trying to create model with Latvian language. Currently I found whenever I run import_cv2.py with --filter_alphabet I getting error .
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x81 in position 0: character maps to < undefined >
My alphabet.txt and train.csv/test/dev contains letters such as that are specific for Latvian language ā, ģ, ž, š. I have tried to replace encoding to ‘cp1257’ in DeepSpeech.py and import_cv2.py , but it still does not work. Any ideas how I could fix this issue ?

lissyx · March 13, 2020, 9:36am

You should not change encoding, utf-8 is fine. Please ensure your data are in correct UTF-8 format. Ensure you run everything under Python 3.

Stanislavs_Davidovics · March 13, 2020, 10:21am

I am running everything on Python 3. What you mean data re in correct UTF-8 format ? As I said there are some specific letters ā, ģ, ž, š that are in alphabet.

lissyx · March 13, 2020, 10:24am

I mean that you obviously have characters improperly encoded in UTF-8.

Yes, and it should work as long as it’s proper UTF-8. We successfully train on a wide range of languages, with non-latin alphabet.

Stanislavs_Davidovics · March 13, 2020, 10:30am

Now I understand. My initial train.tsv/test/dev are encoded in windows 1257, because as soon I change it to UTF-8 ā, ģ, ž, š, they become with ? such symbols, at least that what I see on Ubuntu 18.04

Stanislavs_Davidovics · March 13, 2020, 10:35am

Error message

$ python3 import_cv2.py --filter_alphabet /home/stass/latvian1/alphabet.txt  /home/stass/latvian1
Loading TSV file:  /home/stass/latvian1/train.tsv
Saving new DeepSpeech-formatted CSV file to:  /home/stass/latvian1/clips/train.csv
Traceback (most recent call last):
  File "import_cv2.py", line 163, in <module>
    _preprocess_data(PARAMS.tsv_dir, AUDIO_DIR, label_filter_fun, PARAMS.space_after_every_character)
  File "import_cv2.py", line 42, in _preprocess_data
    _maybe_convert_set(input_tsv, audio_dir, label_filter, space_after_every_character)
  File "import_cv2.py", line 53, in _maybe_convert_set
    for row in reader:
  File "/usr/lib/python3.6/csv.py", line 112, in __next__
    row = next(self.reader)
  File "/home/stass/train/deepspeech-venv1/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 7580: invalid continuation byte

lissyx · March 13, 2020, 10:36am

Well, I don’t know your setup, but you know what to do. Produce proper UTF-8 encoded files.

Stanislavs_Davidovics · March 13, 2020, 10:39am

Last question, are you sure that UTF-8 support letters such I mentioned above ?

but when I look at https://en.wikipedia.org/wiki/Windows-1257 I see them

I am not pro and maybe I am missing something.

Stanislavs_Davidovics · March 13, 2020, 10:42am

I have just tried once again, I did replace all ā, ģ, ž, š to a, g, z, t and it worked I could start training model import_cv2.oy and DeepSpeech.py, so maybe problem is UTF-8 encoding does not support characters.

lissyx · March 13, 2020, 10:45am

UTF-8 is made to handle mostly any language on earth.

No, UTF-8 is fine. It’s your data that is not properly encoded in UTF-8.

Stanislavs_Davidovics · March 13, 2020, 10:49am

Ok thank you. I will continue to dig why encoding does not work.