Bin/import_cv2.py seems broken

So, now that I proved you the code is working, can we get actionable items and share the content of that file?

No, I haven’t used a docker enviroment.
@lissyx
what does a

head cv-corpus-6.1-2020-12-11/eo/clips/dev.csv

output?

for me it returns

wav_filename,wav_filesize,transcript

Thanks for clarifying. Are you on Windows, Linux etc? Python version?
These all have a bearing on the importer file paths.

As you can see in the ls -hal the files are hundreds of KB or even MB.

That is why since the beginning I’m insisting on your sharing your setup completely. We can’t help you if we don’t know what you do precisely.

Ran the import again with your alphabet, it’s still working:

# ls -hal cv-corpus-6.1-2020-12-11/eo/clips/*.csv
-rw-r--r-- 1 root root 719K Mar 30 08:37 cv-corpus-6.1-2020-12-11/eo/clips/dev.csv
-rw-r--r-- 1 root root 252K Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/other.csv
-rw-r--r-- 1 root root 719K Mar 30 08:36 cv-corpus-6.1-2020-12-11/eo/clips/test.csv
-rw-r--r-- 1 root root 1.9M Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/train-all.csv
-rw-r--r-- 1 root root 1.6M Mar 30 08:37 cv-corpus-6.1-2020-12-11/eo/clips/train.csv
-rw-r--r-- 1 root root 4.5M Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/validated.csv
# head cv-corpus-6.1-2020-12-11/eo/clips/train.csv
wav_filename,wav_filesize,transcript
common_voice_eo_20690131.wav,205100,hiroŝimo estis la sepa urbo laŭ nombro da loĝantoj
common_voice_eo_20690133.wav,152876,ĝi estis iama regna burgo
common_voice_eo_20690129.wav,169004,kun la akvo ankaŭ venas la salo
common_voice_eo_20725920.wav,195884,temas pri malpliiĝanta birdospecio
common_voice_eo_20729065.wav,167468,la lasta speco estas propra al ĉinio
common_voice_eo_20690234.wav,221228,tio estas ankaŭ dank al aktiveco de entreprenistoj
common_voice_eo_20725924.wav,144428,stacioj aspektas relative simile
common_voice_eo_20711894.wav,207404,gravas ankaŭ la geometrio kaj konstruo de ĉirkaŭa medio
common_voice_eo_20690130.wav,318764,la unuiĝinta reĝlando estis la nura lando ankoraŭ milita kontraŭ francio dum alia jaro

ok, if it works for you it’s probably fine. For me it reproducible doesn’t generate file content (but I got a workaround so it’s fine). (on python3.7 in the virtual environments setup like described in readthedocs on a Arch Linux 64bit with a python3.7 from AUR)

Once again, we need the exact steps you followed.

No it’s not fine: this code works. You should not need a workaround.

Either there is a bug in our docs, or somewhere else.

Have you verified the checksum of the eo.tar.gz file to ensure it was downloaded properly?
What is the exact release of common voice you are using?
Can you share complete output (stdout, stderr) when running the importer?
Can you share exact setup steps from the beginning (no “I did as the docs” please)?

Have you verified the checksum of the eo.tar.gz file to ensure it was downloaded properly?
I downloaded it multiple times and curl didn’t report any issue while downloading. I didn’t verify because I couldn’t find checksums using site:commonvoice.mozilla.org checksum
or site:deepspeech.readthedocs.io checksum in google

Can you share complete output (stdout, stderr) when running the importer?

yes ok later (evening)

Can you share exact setup steps from the beginning (no “I did as the docs” please)?

i’ll put my stuff in a git

1 Like

Just because curl does not complain does not mean anything.

There’s a checksum on the bottom of the download page, after you click on the button: sha256 checksum: c19900010aee0f9eb39416406598509b1cdba136a16318e746b1a64f97d7809c

I used the direct links https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/[...].tar.gz

There is no such stable direct link, it’s being generated when you click on the button.