I am trying to create model with Latvian language. Currently I found whenever I run import_cv2.py with --filter_alphabet I getting error .
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x81 in position 0: character maps to < undefined >
My alphabet.txt and train.csv/test/dev contains letters such as that are specific for Latvian language ā, ģ, ž, š. I have tried to replace encoding to ‘cp1257’ in DeepSpeech.py and import_cv2.py , but it still does not work. Any ideas how I could fix this issue ?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
You should not change encoding, utf-8 is fine. Please ensure your data are in correct UTF-8 format. Ensure you run everything under Python 3.
I am running everything on Python 3. What you mean data re in correct UTF-8 format ? As I said there are some specific letters ā, ģ, ž, š that are in alphabet.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
I mean that you obviously have characters improperly encoded in UTF-8.
Yes, and it should work as long as it’s proper UTF-8. We successfully train on a wide range of languages, with non-latin alphabet.
Now I understand. My initial train.tsv/test/dev are encoded in windows 1257, because as soon I change it to UTF-8 ā, ģ, ž, š, they become with ? such symbols, at least that what I see on Ubuntu 18.04
I have just tried once again, I did replace all ā, ģ, ž, š to a, g, z, t and it worked I could start training model import_cv2.oy and DeepSpeech.py, so maybe problem is UTF-8 encoding does not support characters.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
UTF-8 is made to handle mostly any language on earth.
No, UTF-8 is fine. It’s your data that is not properly encoded in UTF-8.