Bin/import_cv2.py seems broken

lissyx · March 30, 2021, 8:19am

@54696d21 If you are not willing to share actionable items, proper steps to reproduce, how can you expect I can investigate and help you ?

54696d21 · March 30, 2021, 8:23am

to give you the benefit of the doubt here: what do you think what yoyr means
(this seems a common interpretation: https://www.urbandictionary.com/define.php?term=YOYR)
Which I understand as a rude sexist slur (hence unprofessional)

lissyx · March 30, 2021, 8:23am

What.

On my keyboard, Y is next to U: that’s called a typo.

lissyx · March 30, 2021, 8:26am

Skipped 53 samples that were longer than 10 seconds.
Final amount of imported audio: 4:48:24 from 4:57:27.
Saving new DeepSpeech-formatted CSV file to:  /DeepSpeech/cv-corpus-6.1-2020-12-11/eo/clips/other.csv
Writing CSV file for DeepSpeech.py as:  /DeepSpeech/cv-corpus-6.1-2020-12-11/eo/clips/other.csv

$ docker run -it mozilla/deepspeech:v0.9.3
# apt update
# apt install sox libsox-fmt-mp3
# cd /DeepSpeech
# wget esperanto-release-URL
# tar xf eo.tar.gz
# python bin/import_cv2.py cv-corpus-6.1-2020-12-11/eo/
[...]
# # ls -hal cv-corpus-6.1-2020-12-11/eo/clips/*.csv
-rw-r--r-- 1 root root 726K Mar 30 08:22 cv-corpus-6.1-2020-12-11/eo/clips/dev.csv
-rw-r--r-- 1 root root 253K Mar 30 08:24 cv-corpus-6.1-2020-12-11/eo/clips/other.csv
-rw-r--r-- 1 root root 727K Mar 30 08:22 cv-corpus-6.1-2020-12-11/eo/clips/test.csv
-rw-r--r-- 1 root root 1.9M Mar 30 08:24 cv-corpus-6.1-2020-12-11/eo/clips/train-all.csv
-rw-r--r-- 1 root root 1.6M Mar 30 08:23 cv-corpus-6.1-2020-12-11/eo/clips/train.csv
-rw-r--r-- 1 root root 4.5M Mar 30 08:24 cv-corpus-6.1-2020-12-11/eo/clips/validated.csv

lissyx · March 30, 2021, 8:28am

You made me discover that. This was just a typo. And to be honest, my sentence with this yoyr would make no sense at all.

kreid · March 30, 2021, 8:28am

Hi Tim,

Are you able to clarify some information for me? I’m a colleague of @lissy’s and I wrote the recent PlayBook - if there’s an error in there I’d like to get to the bottom of it so that other DeepSpeech developers don’t experience the same frustration.

DeepSpeech version - I will assume 0.9.3
Setup - if you are using the PlayBook, I will assume Ubuntu Linux under Docker, but could you please confirm?
And are you using the import instructions in the data section of the PlayBook?
And if you use these instructions, the resulting csv file is headers only, no data? Are you able to provide terminal output or the error message that occurs? This will help us to resolve the error.

lissyx · March 30, 2021, 8:30am

So, now that I proved you the code is working, can we get actionable items and share the content of that file?

54696d21 · March 30, 2021, 8:34am

gist.github.com

https://gist.github.com/54696d21/ce879af04b8ae0cd4c723e103f88034e

gistfile1.txt

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
 
a
b
c
d
e

This file has been truncated. show original

54696d21 · March 30, 2021, 8:37am

No, I haven’t used a docker enviroment.
@lissyx
what does a

head cv-corpus-6.1-2020-12-11/eo/clips/dev.csv

output?

for me it returns

wav_filename,wav_filesize,transcript

kreid · March 30, 2021, 8:38am

Thanks for clarifying. Are you on Windows, Linux etc? Python version?
These all have a bearing on the importer file paths.

lissyx · March 30, 2021, 8:38am

As you can see in the ls -hal the files are hundreds of KB or even MB.

lissyx · March 30, 2021, 8:38am

That is why since the beginning I’m insisting on your sharing your setup completely. We can’t help you if we don’t know what you do precisely.

lissyx · March 30, 2021, 8:39am

Ran the import again with your alphabet, it’s still working:

# ls -hal cv-corpus-6.1-2020-12-11/eo/clips/*.csv
-rw-r--r-- 1 root root 719K Mar 30 08:37 cv-corpus-6.1-2020-12-11/eo/clips/dev.csv
-rw-r--r-- 1 root root 252K Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/other.csv
-rw-r--r-- 1 root root 719K Mar 30 08:36 cv-corpus-6.1-2020-12-11/eo/clips/test.csv
-rw-r--r-- 1 root root 1.9M Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/train-all.csv
-rw-r--r-- 1 root root 1.6M Mar 30 08:37 cv-corpus-6.1-2020-12-11/eo/clips/train.csv
-rw-r--r-- 1 root root 4.5M Mar 30 08:38 cv-corpus-6.1-2020-12-11/eo/clips/validated.csv

lissyx · March 30, 2021, 8:39am

# head cv-corpus-6.1-2020-12-11/eo/clips/train.csv
wav_filename,wav_filesize,transcript
common_voice_eo_20690131.wav,205100,hiroŝimo estis la sepa urbo laŭ nombro da loĝantoj
common_voice_eo_20690133.wav,152876,ĝi estis iama regna burgo
common_voice_eo_20690129.wav,169004,kun la akvo ankaŭ venas la salo
common_voice_eo_20725920.wav,195884,temas pri malpliiĝanta birdospecio
common_voice_eo_20729065.wav,167468,la lasta speco estas propra al ĉinio
common_voice_eo_20690234.wav,221228,tio estas ankaŭ dank al aktiveco de entreprenistoj
common_voice_eo_20725924.wav,144428,stacioj aspektas relative simile
common_voice_eo_20711894.wav,207404,gravas ankaŭ la geometrio kaj konstruo de ĉirkaŭa medio
common_voice_eo_20690130.wav,318764,la unuiĝinta reĝlando estis la nura lando ankoraŭ milita kontraŭ francio dum alia jaro

54696d21 · March 30, 2021, 8:42am

ok, if it works for you it’s probably fine. For me it reproducible doesn’t generate file content (but I got a workaround so it’s fine). (on python3.7 in the virtual environments setup like described in readthedocs on a Arch Linux 64bit with a python3.7 from AUR)

lissyx · March 30, 2021, 8:45am

Once again, we need the exact steps you followed.

No it’s not fine: this code works. You should not need a workaround.

Either there is a bug in our docs, or somewhere else.

lissyx · March 30, 2021, 8:47am

Have you verified the checksum of the eo.tar.gz file to ensure it was downloaded properly?
What is the exact release of common voice you are using?
Can you share complete output (stdout, stderr) when running the importer?
Can you share exact setup steps from the beginning (no “I did as the docs” please)?

54696d21 · March 30, 2021, 8:54am

Have you verified the checksum of the eo.tar.gz file to ensure it was downloaded properly?
I downloaded it multiple times and curl didn’t report any issue while downloading. I didn’t verify because I couldn’t find checksums using site:commonvoice.mozilla.org checksum
or site:deepspeech.readthedocs.io checksum in google

Can you share complete output (stdout, stderr) when running the importer?

yes ok later (evening)

Can you share exact setup steps from the beginning (no “I did as the docs” please)?

i’ll put my stuff in a git

lissyx · March 30, 2021, 8:56am

Just because curl does not complain does not mean anything.

There’s a checksum on the bottom of the download page, after you click on the button: sha256 checksum: c19900010aee0f9eb39416406598509b1cdba136a16318e746b1a64f97d7809c

54696d21 · March 30, 2021, 8:59am

I used the direct links https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/[...].tar.gz