I am trying to run the import_cv2.py script. I have downloaded the english commonvoice release and my environment is a docker container from the repo.
The input
python3 bin/import_cv2.py --filter_alphabet data/alphabet.txt data/cv/en/ --normalize
(the alphabet file is also from the repo. same result with or without --normalize)
The error
Saving new DeepSpeech-formatted CSV file to: data/cv/en/clips/train.csv
Traceback (most recent call last):
File "bin/import_cv2.py", line 158, in <module>
_preprocess_data(params.tsv_dir, audio_dir, label_filter)
File "bin/import_cv2.py", line 43, in _preprocess_data
_maybe_convert_set(input_tsv, audio_dir, label_filter)
File "bin/import_cv2.py", line 56, in _maybe_convert_set
for row in reader:
File "/usr/lib/python3.6/csv.py", line 111, in __next__
self.fieldnames
File "/usr/lib/python3.6/csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8085: ordinal not in range(128)
I’m guessing this is some mp3 prefix or buffer but I’m not really familiar with it. I searched the github repo as well as this forum and found no prior issues. Perhaps I’m not running it correctly.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Strange, we had no issue, can you give more context on your system, what release of common voice ? Also try without --filter_alphabet ?
I tried the process outside of docker, it works with --alphabet. md5sum checks out as well. I can continue to troubleshoot and post if I find the root cause. It must be my container. I’m going to prune my system and rebuild the image/container. Looks like the Dockerfile was also updated a couple days ago, I’ll use the update and see what happens.
I’ve since resolved this by not running the pre-processing steps within a container but thought I’d just follow up.
Ubuntu 18.04
Docker version 18.09.5, build e8ff056
nvidia-docker 2.0
Common Voice 2.0 (Whatever’s currently on the Mozilla Website as of my original post) Dockerfile