Alphabet cannot encode transcript

Hey,
I was trying to train my own model with English data set .
(50 GB from common voice)
but after first epoch I’ve got following error:

Alphabet cannot encode transcript “” while processing sample “…/DataBase/en/clips/common_voice_en_1655Preformatted text6180.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [].

does anybody knows why it happened?

Care to follow the basic guidelines before reaching out to support and expose more precisely what " I was trying to train my own model with English data set ." stands for?

I downloaded 50GB dataBase from common voice and I was trying to train it with this guid but i’ve got the error that i mentioned in first comment.

The import script is not perfect, follow the guidelines and search before you post:

Im getting the same error. I followed the steps given in Training 0.9.1 and when I run the command python3 DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv

ValueError: Alphabet cannot encode transcript “it’s true then” while processing sample “/mnt/d/Sundar/DeepSpeechDataSet/cv-corpus-5.1-2020-06-22/en/clips/common_voice_en_22311413.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [’’’].

Please let me know how to resolve this issue.

Even am also getting the same error.can any one help me to get out of this.

ValueError: Alphabet cannot encode transcript “course 1 and course 2 mass been different” while processing sample “audio/Train/3770_2.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [‘1’, ‘2’].

I have the same problem as described from Mohammad “missing characters are []”.
I am following the Playbook and I have put on the alphabet all words that appear on the .csv files (double checked it with check_characters.py.

How to put the “” character on the alphabet?

Not to advertise, but if you’re using Common Voice data, consider using commonvoice-utils which contains preprocessing code and language data for most of the languages of Common Voice.

1 Like

I am not using common voice, I have my own voice and csv data that I parsed.
I found the source of the problem that I wrote.
The “” character was a wrong entry on the test.csv file that was producing the error. Actually I had the wav source, the length but the string was empty ("") so it couldn’t handle the empty wav string.

1 Like

Great! So you solved it?

Yes, it was solved.
But still when using a different scorer the words are not recognized correctly.
While the words that were included on the training pack are being recognized with almost 100% success rate.
So my question now is if we will have to make a new training with the new vocabulary to have a better success rate?

The answer is probably yes, e.g. training with in domain data will improve your results.