Bin/import_cv2.py seems broken

@54696d21 If you are not willing to share actionable items, proper steps to reproduce, how can you expect I can investigate and help you ?

to give you the benefit of the doubt here: what do you think what yoyr means
(this seems a common interpretation: https://www.urbandictionary.com/define.php?term=YOYR)
Which I understand as a rude sexist slur (hence unprofessional)

What.

On my keyboard, Y is next to U: that’s called a typo.

Skipped 53 samples that were longer than 10 seconds.
Final amount of imported audio: 4:48:24 from 4:57:27.
Saving new DeepSpeech-formatted CSV file to:  /DeepSpeech/cv-corpus-6.1-2020-12-11/eo/clips/other.csv
Writing CSV file for DeepSpeech.py as:  /DeepSpeech/cv-corpus-6.1-2020-12-11/eo/clips/other.csv
$ docker run -it mozilla/deepspeech:v0.9.3
# apt update
# apt install sox libsox-fmt-mp3
# cd /DeepSpeech
# wget esperanto-release-URL
# tar xf eo.tar.gz
# python bin/import_cv2.py cv-corpus-6.1-2020-12-11/eo/
[...]
# # ls -hal cv-corpus-6.1-2020-12-11/eo/clips/*.csv
-rw-r--r-- 1 root root 726K Mar 30 08:22 cv-corpus-6.1-2020-12-11/eo/clips/dev.csv
-rw-r--r-- 1 root root 253K Mar 30 08:24 cv-corpus-6.1-2020-12-11/eo/clips/other.csv
-rw-r--r-- 1 root root 727K Mar 30 08:22 cv-corpus-6.1-2020-12-11/eo/clips/test.csv
-rw-r--r-- 1 root root 1.9M Mar 30 08:24 cv-corpus-6.1-2020-12-11/eo/clips/train-all.csv
-rw-r--r-- 1 root root 1.6M Mar 30 08:23 cv-corpus-6.1-2020-12-11/eo/clips/train.csv
-rw-r--r-- 1 root root 4.5M Mar 30 08:24 cv-corpus-6.1-2020-12-11/eo/clips/validated.csv

You made me discover that. This was just a typo. And to be honest, my sentence with this yoyr would make no sense at all.

Hi Tim,

Are you able to clarify some information for me? I’m a colleague of @lissy’s and I wrote the recent PlayBook - if there’s an error in there I’d like to get to the bottom of it so that other DeepSpeech developers don’t experience the same frustration.

  • DeepSpeech version - I will assume 0.9.3
  • Setup - if you are using the PlayBook, I will assume Ubuntu Linux under Docker, but could you please confirm?
  • And are you using the import instructions in the data section of the PlayBook?
  • And if you use these instructions, the resulting csv file is headers only, no data? Are you able to provide terminal output or the error message that occurs? This will help us to resolve the error.

So, now that I proved you the code is working, can we get actionable items and share the content of that file?

No, I haven’t used a docker enviroment.
@lissyx
what does a

head cv-corpus-6.1-2020-12-11/eo/clips/dev.csv

output?

for me it returns

wav_filename,wav_filesize,transcript

Thanks for clarifying. Are you on Windows, Linux etc? Python version?
These all have a bearing on the importer file paths.

As you can see in the ls -hal the files are hundreds of KB or even MB.

That is why since the beginning I’m insisting on your sharing your setup completely. We can’t help you if we don’t know what you do precisely.

Ran the import again with your alphabet, it’s still working:

# ls -hal cv-corpus-6.1-2020-12-11/eo/clips/*.csv
-rw-r--r-- 1 root root 719K Mar 30 08:37 cv-corpus-6.1-2020-12-11/eo/clips/dev.csv
-rw-r--r-- 1 root root 252K Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/other.csv
-rw-r--r-- 1 root root 719K Mar 30 08:36 cv-corpus-6.1-2020-12-11/eo/clips/test.csv
-rw-r--r-- 1 root root 1.9M Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/train-all.csv
-rw-r--r-- 1 root root 1.6M Mar 30 08:37 cv-corpus-6.1-2020-12-11/eo/clips/train.csv
-rw-r--r-- 1 root root 4.5M Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/validated.csv
# head cv-corpus-6.1-2020-12-11/eo/clips/train.csv
wav_filename,wav_filesize,transcript
common_voice_eo_20690131.wav,205100,hiroŝimo estis la sepa urbo laŭ nombro da loĝantoj
common_voice_eo_20690133.wav,152876,ĝi estis iama regna burgo
common_voice_eo_20690129.wav,169004,kun la akvo ankaŭ venas la salo
common_voice_eo_20725920.wav,195884,temas pri malpliiĝanta birdospecio
common_voice_eo_20729065.wav,167468,la lasta speco estas propra al ĉinio
common_voice_eo_20690234.wav,221228,tio estas ankaŭ dank al aktiveco de entreprenistoj
common_voice_eo_20725924.wav,144428,stacioj aspektas relative simile
common_voice_eo_20711894.wav,207404,gravas ankaŭ la geometrio kaj konstruo de ĉirkaŭa medio
common_voice_eo_20690130.wav,318764,la unuiĝinta reĝlando estis la nura lando ankoraŭ milita kontraŭ francio dum alia jaro

ok, if it works for you it’s probably fine. For me it reproducible doesn’t generate file content (but I got a workaround so it’s fine). (on python3.7 in the virtual environments setup like described in readthedocs on a Arch Linux 64bit with a python3.7 from AUR)

Once again, we need the exact steps you followed.

No it’s not fine: this code works. You should not need a workaround.

Either there is a bug in our docs, or somewhere else.

Have you verified the checksum of the eo.tar.gz file to ensure it was downloaded properly?
What is the exact release of common voice you are using?
Can you share complete output (stdout, stderr) when running the importer?
Can you share exact setup steps from the beginning (no “I did as the docs” please)?

Have you verified the checksum of the eo.tar.gz file to ensure it was downloaded properly?
I downloaded it multiple times and curl didn’t report any issue while downloading. I didn’t verify because I couldn’t find checksums using site:commonvoice.mozilla.org checksum
or site:deepspeech.readthedocs.io checksum in google

Can you share complete output (stdout, stderr) when running the importer?

yes ok later (evening)

Can you share exact setup steps from the beginning (no “I did as the docs” please)?

i’ll put my stuff in a git

1 Like

Just because curl does not complain does not mean anything.

There’s a checksum on the bottom of the download page, after you click on the button: sha256 checksum: c19900010aee0f9eb39416406598509b1cdba136a16318e746b1a64f97d7809c

I used the direct links https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/[...].tar.gz