Run import_cv2.py validated.tsv stop at 78%

LucieDevGirl · February 24, 2021, 2:47pm

Hi I meet a problem when running import_cv2.py

Loading test.tsv , dev.tsv and train.tsv are ok at 100% but when happens loading validated.tsv it stops at 78% with the following error :

Progress |#########################################################################################################################################                                       |  78% completedmultiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "bin/import_cv2.py", line 71, in one_sample
    subprocess.check_output(
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['soxi', '-s', '../cv-corpus-6.1-2020-12-11/fr/clips/common_voice_fr_17892259.wav']' returned non-zero exit status 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "bin/import_cv2.py", line 221, in <module>
    main()
  File "bin/import_cv2.py", line 216, in main
    _preprocess_data(PARAMS.tsv_dir, audio_dir, PARAMS.space_after_every_character)
  File "bin/import_cv2.py", line 172, in _preprocess_data
    set_samples = _maybe_convert_set(dataset, tsv_dir, audio_dir, space_after_every_character)
  File "bin/import_cv2.py", line 127, in _maybe_convert_set
    for i, processed in enumerate(pool.imap_unordered(one_sample, samples), start=1):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
subprocess.CalledProcessError: Command '['soxi', '-s', '../cv-corpus-6.1-2020-12-11/fr/clips/common_voice_fr_17892259.wav']' returned non-zero exit status 1.

So I checked the common_voice_fr_17892259.wav it’s a 0 bytes file, which explains the bug BUT I already had this issue and I found the 5 or 6 .wav files of 0 bytes. So I went to validated.tsv and removed the lines containing these files.
Then I ran the import_cv2.py again and got the same error at 78% too (with an other wav files this time) ! I went back to my clips folder and can see that there are now more than 10 .wav files of 0 bytes !

What are these corrupted .wav files ? Does someone know how to solve this problem ?

PS : I need to specify that I don’t use the pretrained french model because I don’t know how tu use it, not yet familiar with docker

lissyx · February 24, 2021, 3:17pm

There’s documentation, which I can’t improve if people don’t share their feedback.

othiele · February 24, 2021, 3:20pm

If you want a French model for regular speech, just try lissyx’s Docker. It is a lot easier and faster than trying to train your own model.

As for your problem, check the script that it is not downloading the fresh material again and overwriting your changes.

LucieDevGirl · February 24, 2021, 3:25pm

is this a documentation ?

lissyx · February 24, 2021, 3:42pm

There’s a CONTRIBUTING.MD file next to the dockerfile it should give enough infos on how to run.

Anyway your corruption issue is not something i experienced so i can only suggest to redownload

lissyx · February 24, 2021, 3:45pm

BTW the pretrained model does not require anything related to docker, just download the files, extract and use as with the english …

lissyx · February 24, 2021, 3:45pm

This is the released model, ready to use.

lissyx · February 24, 2021, 4:13pm

this is weird, I don’t have this file for the Common Voice FR dataset (version documented in the dockerfile / release):

$ ll ~/tmp/deepspeech-fr-docker/extracted/data/cv-fr/clips/common_voice_fr_17892256*
ls: impossible d'accéder à '/home/alexandre/tmp/deepspeech-fr-docker/extracted/data/cv-fr/clips/common_voice_fr_17892256*': Aucun fichier ou dossier de ce type

Can you cross-check your tar.gz against this sha256? We check for it in the code for running the docker training:

github.com/common-voice/commonvoice-fr

DeepSpeech/fr/params.sh

c201ff721


      
          export CV_RELEASE_FILENAME="cv-4-fr.tar.gz"
          export CV_RELEASE_SHA256="ffda45f2006fb6092fb435c786cde422e38183f7837e9faa65cb273439cf369e"

You should have ffda45f2006fb6092fb435c786cde422e38183f7837e9faa65cb273439cf369e

LucieDevGirl · February 24, 2021, 4:44pm

Actually I have 719ef964b55d830a095a602aff311db39b77239e9d600b6af646ec2ed57e5e45

I didn’t get cv-4-fr.tar.gz at this address https://commonvoice.mozilla.org/fr/datasets
mine is fr.tar.gz

lissyx · February 24, 2021, 4:49pm

Looking at the website, they advertise as Common Voice 6.1 while mine is 4 :). Maybe they have some broken release. You should file a bug with references of the offending files.

lissyx · February 24, 2021, 4:50pm

And it does match what is advertised on the website sha256 checksum: 719ef964b55d830a095a602aff311db39b77239e9d600b6af646ec2ed57e5e45

lissyx · February 24, 2021, 4:52pm

Wait wait wait. The wav files are produced by import_cv2.py. The released dataset should only refer to .mp3.

lissyx · February 24, 2021, 5:26pm

The 0 byte is just a symptom that something failed during MP3 → WAV conversion. Please check the matching .mp3 files? Have you tried removing the 0-byte wav and re-running ? You said you re-ran but no mention of whether you kept the 0-bytes files or not.

lissyx · February 24, 2021, 5:28pm

I find no reference to “17892256” within any of the .tsv files nor within the clips/ directory.

lissyx · February 24, 2021, 5:35pm

Maybe change this try/except/pass https://github.com/mozilla/DeepSpeech/blob/master/bin/import_cv2.py#L184-L187 and remove it, maybe it will highlight the SoX failure?

LucieDevGirl · February 24, 2021, 7:35pm

no I didn’t removed them I just removed the lines in the tsv calling the mp3 associated. But you’re right it was a mistake of my part, I will download your version 4 this time and try again

LucieDevGirl · February 24, 2021, 7:40pm

Okk thank’s a lot!

Will try again then try this if it doesn’t work ! thank’s for your help !

lissyx · February 25, 2021, 8:45am

Perfect, just let us know, and please chime some light on this common_voice_fr_17892256.mp3 because I can’t find it at all when I downloaded 6.1 release.

If you confirm the dataset release is indeed broken, then it’s an issue to report to Common Voice, not us: GitHub · Where software is built

However, please note that Update on Common Voice: Mozilla Foundation still applies and it might be a few days / week until people can resume working on the project and fix that kind of issues.

LucieDevGirl · February 25, 2021, 9:58am

my bad, it was not 17892256.mp3 but 1782259.mp3 I edit my question sorry

lissyx · February 25, 2021, 10:07am

OK, well you have the infos to investigate further now, chances are it’s just broken is the release and the best you can do is identify the broken mp3, remove then from tsv and report upstream