Skipped % samples that failed on transcript validation

Hello, I think I getting annoying with my questions.

I am using Common Voice downloaded for Latvian language. Have tried to build a model, have fixed issue with UTF-8, but now I have issue with failed on transcription. What failed in transcription validation means?

python3 import_cv2.py --filter_alphabet /home/stass/latvian1/alphabet.txt /home/stass/latvian1
Loading TSV file:  /home/stass/latvian1/train.tsv
Saving new DeepSpeech-formatted CSV file to:  /home/stass/latvian1/clips/train.csv
Importing mp3 files...
Progress |###################################################  |  97% completedWriting CSV file for DeepSpeech.py as:  /home/stass/latvian1/clips/train.csv
Progress |#####################################################| 100% completed
Imported 120 samples.
Skipped 119 samples that failed on transcript validation.
Final amount of imported audio: 0:07:33.
Loading TSV file:  /home/stass/latvian1/test.tsv
Saving new DeepSpeech-formatted CSV file to:  /home/stass/latvian1/clips/test.csv
Importing mp3 files...
Progress |#######################################              |  75% completedWriting CSV file for DeepSpeech.py as:  /home/stass/latvian1/clips/test.csv
Progress |#                                                    | 100% completed
Imported 4 samples.
Skipped 4 samples that failed on transcript validation.
Final amount of imported audio: 0:00:14.
Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed

Just take a look at the importer’s code: https://github.com/mozilla/DeepSpeech/blob/29a2ac37f001e2b37a720d5b8da4a64b4aa384d6/bin/import_cv2.py#L119-L120
https://github.com/mozilla/DeepSpeech/blob/29a2ac37f001e2b37a720d5b8da4a64b4aa384d6/bin/import_cv2.py#L81-L83

It’s those which label_filter rejected.

Can I know which excactly triggers warming ?

label = label_filter(sample[1])
        with lock:
            if file_size == -1:
                # Excluding samples that failed upon conversion
                counter['failed'] += 1
            elif label is None:
                # Excluding samples that failed on label validation
                counter['invalid_label'] += 1
            elif int(frames/SAMPLE_RATE*1000/10/2) < len(str(label)):
                # Excluding samples that are too short to fit the transcript
                counter['too_short'] += 1
            elif frames/SAMPLE_RATE > MAX_SECS:
                # Excluding very long samples to keep a reasonable batch-size
                counter['too_long'] += 1

Please, this is free and open source software, I’ve linked you to the proper code and explained to you, can you make an effort and read label_filter ?

Ok thank you I will try my best.

What does invalid label means ? That was my question , sorry for unclear explanation.

I already replied to you. Those rejected by label_filter

Usually just 1% of CV data is corrupt, for smaller languages this can be higher. @lissyx is hinting at that about half your samples seem to be bad and are therefore left out. Simply go through the dir and listen to them. You’ll see the problem :slight_smile:

For better analyis I would put something like

print("Bad file (reason -1): " + sample[0])

in each if to see which files cause what problem. This will help you in the future, I know that from experience.

I found solution to this. Alphabet.txt must contain space as character.

2 Likes