CommonVoice import_cv2.py error

Hello,

I am trying to run the import_cv2.py script. I have downloaded the english commonvoice release and my environment is a docker container from the repo.

The input

python3 bin/import_cv2.py --filter_alphabet data/alphabet.txt data/cv/en/ --normalize
(the alphabet file is also from the repo. same result with or without --normalize)

The error

Saving new DeepSpeech-formatted CSV file to:  data/cv/en/clips/train.csv
Traceback (most recent call last):
  File "bin/import_cv2.py", line 158, in <module>
    _preprocess_data(params.tsv_dir, audio_dir, label_filter)
  File "bin/import_cv2.py", line 43, in _preprocess_data
    _maybe_convert_set(input_tsv, audio_dir, label_filter)
  File "bin/import_cv2.py", line 56, in _maybe_convert_set
    for row in reader:
  File "/usr/lib/python3.6/csv.py", line 111, in __next__
    self.fieldnames
  File "/usr/lib/python3.6/csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8085: ordinal not in range(128)

I’m guessing this is some mp3 prefix or buffer but I’m not really familiar with it. I searched the github repo as well as this forum and found no prior issues. Perhaps I’m not running it correctly.

Strange, we had no issue, can you give more context on your system, what release of common voice ? Also try without --filter_alphabet ?

Just imported CV2-en with latest master.

DeepSpeech$ python3 bin/import_cv2.py --normalize --filter_alphabet data/alphabet.txt /work/CommonVoice/en
Loading TSV file:  /work/CommonVoice/en/train.tsv
Saving new DeepSpeech-formatted CSV file to:  /work/CommonVoice/en/clips/train.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as:  /work/CommonVoice/en/clips/train.csv
Progress |####################################| 100% completed
Imported 12123 samples.
Skipped 72 samples that failed on transcript validation.
Skipped 12 samples that were longer than 10 seconds.
Loading TSV file:  /work/CommonVoice/en/test.tsv
Saving new DeepSpeech-formatted CSV file to:  /work/CommonVoice/en/clips/test.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as:  /work/CommonVoice/en/clips/test.csv
Progress |####################################| 100% completed
Imported 6804 samples.
Skipped 306 samples that failed on transcript validation.
Skipped 212 samples that were longer than 10 seconds.
Loading TSV file:  /work/CommonVoice/en/dev.tsv
Saving new DeepSpeech-formatted CSV file to:  /work/CommonVoice/en/clips/dev.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as:  /work/CommonVoice/en/clips/dev.csv
Progress |####################################| 100% completed
Imported 6940 samples.
Skipped 3 samples that failed upon conversion.
Skipped 268 samples that failed on transcript validation.
Skipped 73 samples that were longer than 10 seconds.
Progress |####################################| 100% completed
Progress |####################################| 100% completed
Progress |####################################| 100% completed

Maybe your archive is different - could you run md5sum on it?

DeepSpeech$ md5sum /work/CommonVoice/en.tar.gz 
a639b0e22b969d76abe1c40beb0d3439  /work/CommonVoice/en.tar.gz

I tried the process outside of docker, it works with --alphabet. md5sum checks out as well. I can continue to troubleshoot and post if I find the root cause. It must be my container. I’m going to prune my system and rebuild the image/container. Looks like the Dockerfile was also updated a couple days ago, I’ll use the update and see what happens.

I’ve since resolved this by not running the pre-processing steps within a container but thought I’d just follow up.

Ubuntu 18.04
Docker version 18.09.5, build e8ff056
nvidia-docker 2.0
Common Voice 2.0 (Whatever’s currently on the Mozilla Website as of my original post)
Dockerfile

brandon@daedalus:~/Documents/DeepSpeech$ docker build -f Dockerfile -t deepspeech .
brandon@daedalus:~/Documents/DeepSpeech$ nvidia-docker run -it -d \
  -p 8888:8888 -p 6006:6006 \
  -u $(id -u):$(id -g) \
  -e HOME=/home/$USER \
  -v /home/$USER:/home/$USER \
  -v /home/brandon/raid/share:/ncc1701 \
  deepspeech
brandon@daedalus:~/Documents/DeepSpeech$ docker exec -it 7f1423f7c6d1 bash

Process for getting audio outside Docker which worked:

brandon@daedalus:~/Documents/DeepSpeech/data$ wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-1/en.tar.gz
brandon@daedalus:~/Documents/DeepSpeech/data$ tar -xzf en.tar.gz
brandon@daedalus:~/Documents/DeepSpeech$ python3 bin/import_cv2.py --filter_alphabet data/alphabet.txt data/cv/en/

After handling the process outside of the Docker, I haven’t had any issues training.

1 Like

I fixed this with setting the docker image locale to UTF-8.

apt update
apt install locales

locale-gen en_US.UTF-8

export LANG=en_US.UTF-8

Thanks, do not hesitate to send a PR if needed !