Alphabet cannot encode transcript

Mohammad · October 14, 2020, 12:46pm

Hey,
I was trying to train my own model with English data set .
(50 GB from common voice)
but after first epoch I’ve got following error:

Alphabet cannot encode transcript “” while processing sample “…/DataBase/en/clips/common_voice_en_1655Preformatted text6180.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [].

does anybody knows why it happened?

lissyx · October 14, 2020, 2:19pm

Care to follow the basic guidelines before reaching out to support and expose more precisely what " I was trying to train my own model with English data set ." stands for?

Mohammad · October 15, 2020, 3:53pm

I downloaded 50GB dataBase from common voice and I was trying to train it with this guid but i’ve got the error that i mentioned in first comment.

othiele · October 15, 2020, 6:46pm

The import script is not perfect, follow the guidelines and search before you post:

EsakkiSundar_Varatharajan · November 20, 2020, 8:02am

Im getting the same error. I followed the steps given in Training 0.9.1 and when I run the command python3 DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv

ValueError: Alphabet cannot encode transcript “it’s true then” while processing sample “/mnt/d/Sundar/DeepSpeechDataSet/cv-corpus-5.1-2020-06-22/en/clips/common_voice_en_22311413.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [’’’].

Please let me know how to resolve this issue.

rakesh_reddy · April 7, 2021, 1:11pm

Even am also getting the same error.can any one help me to get out of this.

ValueError: Alphabet cannot encode transcript “course 1 and course 2 mass been different” while processing sample “audio/Train/3770_2.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [‘1’, ‘2’].

michalis_p · May 27, 2021, 9:24am

I have the same problem as described from Mohammad “missing characters are []”.
I am following the Playbook and I have put on the alphabet all words that appear on the .csv files (double checked it with check_characters.py.

How to put the “” character on the alphabet?

ftyers · May 28, 2021, 2:38am

Not to advertise, but if you’re using Common Voice data, consider using commonvoice-utils which contains preprocessing code and language data for most of the languages of Common Voice.

michalis_p · May 28, 2021, 6:22am

I am not using common voice, I have my own voice and csv data that I parsed.
I found the source of the problem that I wrote.
The “” character was a wrong entry on the test.csv file that was producing the error. Actually I had the wav source, the length but the string was empty ("") so it couldn’t handle the empty wav string.

ftyers · May 29, 2021, 4:45pm

Great! So you solved it?

michalis_p · May 31, 2021, 9:59am

Yes, it was solved.
But still when using a different scorer the words are not recognized correctly.
While the words that were included on the training pack are being recognized with almost 100% success rate.
So my question now is if we will have to make a new training with the new vocabulary to have a better success rate?

ftyers · June 1, 2021, 6:34pm

The answer is probably yes, e.g. training with in domain data will improve your results.