CommonVoice import_cv2.py error

tuttlebr · April 30, 2019, 3:07am

Hello,

I am trying to run the import_cv2.py script. I have downloaded the english commonvoice release and my environment is a docker container from the repo.

The input

python3 bin/import_cv2.py --filter_alphabet data/alphabet.txt data/cv/en/ --normalize
(the alphabet file is also from the repo. same result with or without --normalize)

The error

Saving new DeepSpeech-formatted CSV file to:  data/cv/en/clips/train.csv
Traceback (most recent call last):
  File "bin/import_cv2.py", line 158, in <module>
    _preprocess_data(params.tsv_dir, audio_dir, label_filter)
  File "bin/import_cv2.py", line 43, in _preprocess_data
    _maybe_convert_set(input_tsv, audio_dir, label_filter)
  File "bin/import_cv2.py", line 56, in _maybe_convert_set
    for row in reader:
  File "/usr/lib/python3.6/csv.py", line 111, in __next__
    self.fieldnames
  File "/usr/lib/python3.6/csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8085: ordinal not in range(128)

I’m guessing this is some mp3 prefix or buffer but I’m not really familiar with it. I searched the github repo as well as this forum and found no prior issues. Perhaps I’m not running it correctly.

lissyx · April 30, 2019, 6:53am

Strange, we had no issue, can you give more context on your system, what release of common voice ? Also try without --filter_alphabet ?

Tilman_Kamp · April 30, 2019, 8:33am

Just imported CV2-en with latest master.

DeepSpeech$ python3 bin/import_cv2.py --normalize --filter_alphabet data/alphabet.txt /work/CommonVoice/en
Loading TSV file:  /work/CommonVoice/en/train.tsv
Saving new DeepSpeech-formatted CSV file to:  /work/CommonVoice/en/clips/train.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as:  /work/CommonVoice/en/clips/train.csv
Progress |####################################| 100% completed
Imported 12123 samples.
Skipped 72 samples that failed on transcript validation.
Skipped 12 samples that were longer than 10 seconds.
Loading TSV file:  /work/CommonVoice/en/test.tsv
Saving new DeepSpeech-formatted CSV file to:  /work/CommonVoice/en/clips/test.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as:  /work/CommonVoice/en/clips/test.csv
Progress |####################################| 100% completed
Imported 6804 samples.
Skipped 306 samples that failed on transcript validation.
Skipped 212 samples that were longer than 10 seconds.
Loading TSV file:  /work/CommonVoice/en/dev.tsv
Saving new DeepSpeech-formatted CSV file to:  /work/CommonVoice/en/clips/dev.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as:  /work/CommonVoice/en/clips/dev.csv
Progress |####################################| 100% completed
Imported 6940 samples.
Skipped 3 samples that failed upon conversion.
Skipped 268 samples that failed on transcript validation.
Skipped 73 samples that were longer than 10 seconds.
Progress |####################################| 100% completed
Progress |####################################| 100% completed
Progress |####################################| 100% completed

Maybe your archive is different - could you run md5sum on it?

DeepSpeech$ md5sum /work/CommonVoice/en.tar.gz 
a639b0e22b969d76abe1c40beb0d3439  /work/CommonVoice/en.tar.gz

tuttlebr · April 30, 2019, 12:01pm

I tried the process outside of docker, it works with --alphabet. md5sum checks out as well. I can continue to troubleshoot and post if I find the root cause. It must be my container. I’m going to prune my system and rebuild the image/container. Looks like the Dockerfile was also updated a couple days ago, I’ll use the update and see what happens.

tuttlebr · May 4, 2019, 4:13am

I’ve since resolved this by not running the pre-processing steps within a container but thought I’d just follow up.

Ubuntu 18.04
Docker version 18.09.5, build e8ff056
nvidia-docker 2.0
Common Voice 2.0 (Whatever’s currently on the Mozilla Website as of my original post)
Dockerfile

brandon@daedalus:~/Documents/DeepSpeech$ docker build -f Dockerfile -t deepspeech .
brandon@daedalus:~/Documents/DeepSpeech$ nvidia-docker run -it -d \
  -p 8888:8888 -p 6006:6006 \
  -u $(id -u):$(id -g) \
  -e HOME=/home/$USER \
  -v /home/$USER:/home/$USER \
  -v /home/brandon/raid/share:/ncc1701 \
  deepspeech
brandon@daedalus:~/Documents/DeepSpeech$ docker exec -it 7f1423f7c6d1 bash

Process for getting audio outside Docker which worked:

brandon@daedalus:~/Documents/DeepSpeech/data$ wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-1/en.tar.gz
brandon@daedalus:~/Documents/DeepSpeech/data$ tar -xzf en.tar.gz
brandon@daedalus:~/Documents/DeepSpeech$ python3 bin/import_cv2.py --filter_alphabet data/alphabet.txt data/cv/en/

After handling the process outside of the Docker, I haven’t had any issues training.

geotou · June 14, 2019, 5:04pm

I fixed this with setting the docker image locale to UTF-8.

apt update
apt install locales

locale-gen en_US.UTF-8

export LANG=en_US.UTF-8

lissyx · June 14, 2019, 5:47pm

Thanks, do not hesitate to send a PR if needed !

Topic		Replies	Views
Error while training Common Voice Data DeepSpeech	6	663	November 14, 2019
Can not import my Dataset by using import_cv2.py DeepSpeech	1	511	December 16, 2019
Import_cv2 : all files failed to convert DeepSpeech	7	914	July 27, 2019
Bin/import_cv2.py import 0 samples of CommonVoice ga-IE DeepSpeech	3	622	November 2, 2019
Common voice dataset importing problem DeepSpeech learning , issue	1	765	August 12, 2023

CommonVoice import_cv2.py error

Related topics