Voxforge import file wrong encoding

I want to import the german voxforge dataset to do some testing.
I used the provided import file for the english dataset (DeepSpeech/bin/import_voxforge.py at master · mozilla/DeepSpeech · GitHub) and changed the link to direct to the german dataset ( VoxForge Repository). The downloading/unpacking works fine. However when trying to write the csv files i get the following error:

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf6 in position 43: invalid start byte

I’ve been looking into this for the past hours, tried to change the “utf-8” parameter in the codecs.open line to “latin-1” or "ISO-8859-1 ". with no results (see: Re: UTF-8 instead of ISO-8859-1 - voxforge.org).

In those cases I do get the transcript but almost all umlauts are left blank (so there’s a hole in a word for example).

Does anyone know what i should do to solve this or to be able to use these files?

Thank you in advance :slight_smile:

PS: i had to add “import sys” because of line 223, input arg of the path if i’m right

Smells like Python2 vs Python3. Please use the latter.

I do use Python 3

python3 import_voxforge.py path/to/destination

edit: just some extra info
deepspeech 0.7.0
os :ubuntu 18.04

Look at this repo how to import the German voxforge. Worked for me last year

Seems to have worked! Thank you