Skipped % samples that failed on transcript validation

Stanislavs_Davidovics · March 18, 2020, 5:34pm

Hello, I think I getting annoying with my questions.

I am using Common Voice downloaded for Latvian language. Have tried to build a model, have fixed issue with UTF-8, but now I have issue with failed on transcription. What failed in transcription validation means?

python3 import_cv2.py --filter_alphabet /home/stass/latvian1/alphabet.txt /home/stass/latvian1
Loading TSV file:  /home/stass/latvian1/train.tsv
Saving new DeepSpeech-formatted CSV file to:  /home/stass/latvian1/clips/train.csv
Importing mp3 files...
Progress |###################################################  |  97% completedWriting CSV file for DeepSpeech.py as:  /home/stass/latvian1/clips/train.csv
Progress |#####################################################| 100% completed
Imported 120 samples.
Skipped 119 samples that failed on transcript validation.
Final amount of imported audio: 0:07:33.
Loading TSV file:  /home/stass/latvian1/test.tsv
Saving new DeepSpeech-formatted CSV file to:  /home/stass/latvian1/clips/test.csv
Importing mp3 files...
Progress |#######################################              |  75% completedWriting CSV file for DeepSpeech.py as:  /home/stass/latvian1/clips/test.csv
Progress |#                                                    | 100% completed
Imported 4 samples.
Skipped 4 samples that failed on transcript validation.
Final amount of imported audio: 0:00:14.
Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed

lissyx · March 18, 2020, 5:43pm

Just take a look at the importer’s code: DeepSpeech/bin/import_cv2.py at 29a2ac37f001e2b37a720d5b8da4a64b4aa384d6 · mozilla/DeepSpeech · GitHub
DeepSpeech/bin/import_cv2.py at 29a2ac37f001e2b37a720d5b8da4a64b4aa384d6 · mozilla/DeepSpeech · GitHub

It’s those which label_filter rejected.

Stanislavs_Davidovics · March 18, 2020, 6:14pm

Can I know which excactly triggers warming ?

label = label_filter(sample[1])
        with lock:
            if file_size == -1:
                # Excluding samples that failed upon conversion
                counter['failed'] += 1
            elif label is None:
                # Excluding samples that failed on label validation
                counter['invalid_label'] += 1
            elif int(frames/SAMPLE_RATE*1000/10/2) < len(str(label)):
                # Excluding samples that are too short to fit the transcript
                counter['too_short'] += 1
            elif frames/SAMPLE_RATE > MAX_SECS:
                # Excluding very long samples to keep a reasonable batch-size
                counter['too_long'] += 1

lissyx · March 18, 2020, 6:15pm

Please, this is free and open source software, I’ve linked you to the proper code and explained to you, can you make an effort and read label_filter ?

Stanislavs_Davidovics · March 18, 2020, 6:27pm

Ok thank you I will try my best.

Stanislavs_Davidovics · March 18, 2020, 6:49pm

What does invalid label means ? That was my question , sorry for unclear explanation.

lissyx · March 18, 2020, 6:50pm

I already replied to you. Those rejected by label_filter

othiele · March 18, 2020, 8:32pm

Usually just 1% of CV data is corrupt, for smaller languages this can be higher. @lissyx is hinting at that about half your samples seem to be bad and are therefore left out. Simply go through the dir and listen to them. You’ll see the problem

For better analyis I would put something like

print("Bad file (reason -1): " + sample[0])

in each if to see which files cause what problem. This will help you in the future, I know that from experience.

Stanislavs_Davidovics · March 19, 2020, 12:50pm

I found solution to this. Alphabet.txt must contain space as character.

Topic		Replies	Views
Getting Blank CSV files in the clips folder from TSV Files when trying through import_cv2.py command DeepSpeech	7	805	April 13, 2020
No --validate_label_locale specified, your might end with inconsistent dataset DeepSpeech	6	2427	September 28, 2020
Training Common Voice issue: Invalid argument: Labels length is zero in batch 0 DeepSpeech	9	3832	June 30, 2018
ERROR: Inexistent --validate_label_locale specified DeepSpeech issue	3	1056	January 9, 2021
Getting RuntimeError: No transcript data (missing CSV column) when trying to train a model DeepSpeech	12	1605	April 12, 2020

Skipped % samples that failed on transcript validation

Related topics