Run import_cv2.py validated.tsv stop at 78%

Hi I meet a problem when running import_cv2.py

Loading test.tsv , dev.tsv and train.tsv are ok at 100% but when happens loading validated.tsv it stops at 78% with the following error :

Progress |#########################################################################################################################################                                       |  78% completedmultiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "bin/import_cv2.py", line 71, in one_sample
    subprocess.check_output(
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['soxi', '-s', '../cv-corpus-6.1-2020-12-11/fr/clips/common_voice_fr_17892259.wav']' returned non-zero exit status 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "bin/import_cv2.py", line 221, in <module>
    main()
  File "bin/import_cv2.py", line 216, in main
    _preprocess_data(PARAMS.tsv_dir, audio_dir, PARAMS.space_after_every_character)
  File "bin/import_cv2.py", line 172, in _preprocess_data
    set_samples = _maybe_convert_set(dataset, tsv_dir, audio_dir, space_after_every_character)
  File "bin/import_cv2.py", line 127, in _maybe_convert_set
    for i, processed in enumerate(pool.imap_unordered(one_sample, samples), start=1):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
subprocess.CalledProcessError: Command '['soxi', '-s', '../cv-corpus-6.1-2020-12-11/fr/clips/common_voice_fr_17892259.wav']' returned non-zero exit status 1.

So I checked the common_voice_fr_17892259.wav it’s a 0 bytes file, which explains the bug BUT I already had this issue and I found the 5 or 6 .wav files of 0 bytes. So I went to validated.tsv and removed the lines containing these files.
Then I ran the import_cv2.py again and got the same error at 78% too (with an other wav files this time) ! I went back to my clips folder and can see that there are now more than 10 .wav files of 0 bytes !

What are these corrupted .wav files ? Does someone know how to solve this problem ?

PS : I need to specify that I don’t use the pretrained french model because I don’t know how tu use it, not yet familiar with docker

There’s documentation, which I can’t improve if people don’t share their feedback.

If you want a French model for regular speech, just try lissyx’s Docker. It is a lot easier and faster than trying to train your own model.

As for your problem, check the script that it is not downloading the fresh material again and overwriting your changes.

1 Like

is this a documentation ?

There’s a CONTRIBUTING.MD file next to the dockerfile it should give enough infos on how to run.

Anyway your corruption issue is not something i experienced so i can only suggest to redownload

BTW the pretrained model does not require anything related to docker, just download the files, extract and use as with the english …

This is the released model, ready to use.

this is weird, I don’t have this file for the Common Voice FR dataset (version documented in the dockerfile / release):

$ ll ~/tmp/deepspeech-fr-docker/extracted/data/cv-fr/clips/common_voice_fr_17892256*
ls: impossible d'accéder à '/home/alexandre/tmp/deepspeech-fr-docker/extracted/data/cv-fr/clips/common_voice_fr_17892256*': Aucun fichier ou dossier de ce type

Can you cross-check your tar.gz against this sha256? We check for it in the code for running the docker training:

You should have ffda45f2006fb6092fb435c786cde422e38183f7837e9faa65cb273439cf369e

Actually I have 719ef964b55d830a095a602aff311db39b77239e9d600b6af646ec2ed57e5e45

I didn’t get cv-4-fr.tar.gz at this address https://commonvoice.mozilla.org/fr/datasets
mine is fr.tar.gz

Looking at the website, they advertise as Common Voice 6.1 while mine is 4 :). Maybe they have some broken release. You should file a bug with references of the offending files.

And it does match what is advertised on the website sha256 checksum: 719ef964b55d830a095a602aff311db39b77239e9d600b6af646ec2ed57e5e45

1 Like

Wait wait wait. The wav files are produced by import_cv2.py. The released dataset should only refer to .mp3.

The 0 byte is just a symptom that something failed during MP3 -> WAV conversion. Please check the matching .mp3 files? Have you tried removing the 0-byte wav and re-running ? You said you re-ran but no mention of whether you kept the 0-bytes files or not.

I find no reference to “17892256” within any of the .tsv files nor within the clips/ directory.

Maybe change this try/except/pass https://github.com/mozilla/DeepSpeech/blob/master/bin/import_cv2.py#L184-L187 and remove it, maybe it will highlight the SoX failure?

1 Like

no I didn’t removed them I just removed the lines in the tsv calling the mp3 associated. But you’re right it was a mistake of my part, I will download your version 4 this time and try again

Okk thank’s a lot!

Will try again then try this if it doesn’t work ! thank’s for your help !

Perfect, just let us know, and please chime some light on this common_voice_fr_17892256.mp3 because I can’t find it at all when I downloaded 6.1 release.

If you confirm the dataset release is indeed broken, then it’s an issue to report to Common Voice, not us: https://github.com/common-voice/common-voice/issues

However, please note that Update on Common Voice: Mozilla Foundation still applies and it might be a few days / week until people can resume working on the project and fix that kind of issues.

my bad, it was not 17892256.mp3 but 1782259.mp3 I edit my question sorry

OK, well you have the infos to investigate further now, chances are it’s just broken is the release and the best you can do is identify the broken mp3, remove then from tsv and report upstream :blush: