I want to import the german voxforge dataset to do some testing.
I used the provided import file for the english dataset (DeepSpeech/bin/import_voxforge.py at master · mozilla/DeepSpeech · GitHub) and changed the link to direct to the german dataset ( VoxForge Repository). The downloading/unpacking works fine. However when trying to write the csv files i get the following error:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf6 in position 43: invalid start byte
I’ve been looking into this for the past hours, tried to change the “utf-8” parameter in the codecs.open line to “latin-1” or "ISO-8859-1 ". with no results (see: Re: UTF-8 instead of ISO-8859-1 - voxforge.org).
In those cases I do get the transcript but almost all umlauts are left blank (so there’s a hole in a word for example).
Does anyone know what i should do to solve this or to be able to use these files?
Thank you in advance
PS: i had to add “import sys” because of line 223, input arg of the path if i’m right