This is the released model, ready to use.
this is weird, I don’t have this file for the Common Voice FR dataset (version documented in the dockerfile / release):
$ ll ~/tmp/deepspeech-fr-docker/extracted/data/cv-fr/clips/common_voice_fr_17892256*
ls: impossible d'accéder à '/home/alexandre/tmp/deepspeech-fr-docker/extracted/data/cv-fr/clips/common_voice_fr_17892256*': Aucun fichier ou dossier de ce type
Can you cross-check your tar.gz
against this sha256? We check for it in the code for running the docker training:
You should have ffda45f2006fb6092fb435c786cde422e38183f7837e9faa65cb273439cf369e
Actually I have 719ef964b55d830a095a602aff311db39b77239e9d600b6af646ec2ed57e5e45
I didn’t get cv-4-fr.tar.gz at this address https://commonvoice.mozilla.org/fr/datasets
mine is fr.tar.gz
Looking at the website, they advertise as Common Voice 6.1
while mine is 4 :). Maybe they have some broken release. You should file a bug with references of the offending files.
And it does match what is advertised on the website sha256 checksum: 719ef964b55d830a095a602aff311db39b77239e9d600b6af646ec2ed57e5e45
Wait wait wait. The wav
files are produced by import_cv2.py
. The released dataset should only refer to .mp3
.
The 0 byte is just a symptom that something failed during MP3 -> WAV conversion. Please check the matching .mp3
files? Have you tried removing the 0-byte wav and re-running ? You said you re-ran but no mention of whether you kept the 0-bytes files or not.
I find no reference to “17892256” within any of the .tsv
files nor within the clips/
directory.
Maybe change this try/except/pass
https://github.com/mozilla/DeepSpeech/blob/master/bin/import_cv2.py#L184-L187 and remove it, maybe it will highlight the SoX failure?
no I didn’t removed them I just removed the lines in the tsv calling the mp3 associated. But you’re right it was a mistake of my part, I will download your version 4 this time and try again
Okk thank’s a lot!
Will try again then try this if it doesn’t work ! thank’s for your help !
Perfect, just let us know, and please chime some light on this common_voice_fr_17892256.mp3
because I can’t find it at all when I downloaded 6.1 release.
If you confirm the dataset release is indeed broken, then it’s an issue to report to Common Voice, not us: https://github.com/common-voice/common-voice/issues
However, please note that Update on Common Voice: Mozilla Foundation still applies and it might be a few days / week until people can resume working on the project and fix that kind of issues.
my bad, it was not 17892256.mp3 but 1782259.mp3 I edit my question sorry
OK, well you have the infos to investigate further now, chances are it’s just broken is the release and the best you can do is identify the broken mp3, remove then from tsv and report upstream
OK I thing I discovered the reason of my problem, I realised that I have a space issue on root. (Even if I just installed Linux).
So I investigated and I get this in my projects :
84G /home/lucie/PROJETS/cv-corpus-6.1-2020-12-11
18G /home/lucie/PROJETS/fr.tar.gz
78M /home/lucie/PROJETS/DeepSpeech
Did I something wrong or it is normal that cv-corpus takes so much space ?
If so solution of my problem was just a cleaning space since beginning. I let it here in case other users meet the same error for this
I guess for 19GB (uncompressed) of MP3 + the matching WAV files, it’s expected?
yes exactly so I have my answer now it stopped at 78% and created 0 bytes wav file because of lack of space on my disk. I cleaned it and now import_cv2.py works successfully !
Thank’s for your help !
Thanks for sharing the final feedback. If you want to send a PR to improve this, you’re welcome.
sorry man I would like to help but don’t know what is a PR ?
GitHub Pull Request: to improve the code and maybe try to avoid other to run into the same issues?
I’m not sure exactly what could be fixed here, maybe importers could expose a “likely amount of required space” and check for this to be available?