Hey,
I was trying to train my own model with English data set .
(50 GB from common voice)
but after first epoch I’ve got following error:
Alphabet cannot encode transcript “” while processing sample “…/DataBase/en/clips/common_voice_en_1655Preformatted text6180.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [].
does anybody knows why it happened?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Care to follow the basic guidelines before reaching out to support and expose more precisely what " I was trying to train my own model with English data set ." stands for?
Im getting the same error. I followed the steps given in Training 0.9.1 and when I run the command python3 DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv
ValueError: Alphabet cannot encode transcript “it’s true then” while processing sample “/mnt/d/Sundar/DeepSpeechDataSet/cv-corpus-5.1-2020-06-22/en/clips/common_voice_en_22311413.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [’’’].
Even am also getting the same error.can any one help me to get out of this.
ValueError: Alphabet cannot encode transcript “course 1 and course 2 mass been different” while processing sample “audio/Train/3770_2.wav”, check that your alphabet contains all characters in the training corpus. Missing characters are: [‘1’, ‘2’].
I have the same problem as described from Mohammad “missing characters are []”.
I am following the Playbook and I have put on the alphabet all words that appear on the .csv files (double checked it with check_characters.py.
Not to advertise, but if you’re using Common Voice data, consider using commonvoice-utils which contains preprocessing code and language data for most of the languages of Common Voice.
I am not using common voice, I have my own voice and csv data that I parsed.
I found the source of the problem that I wrote.
The “” character was a wrong entry on the test.csv file that was producing the error. Actually I had the wav source, the length but the string was empty ("") so it couldn’t handle the empty wav string.
Yes, it was solved.
But still when using a different scorer the words are not recognized correctly.
While the words that were included on the training pack are being recognized with almost 100% success rate.
So my question now is if we will have to make a new training with the new vocabulary to have a better success rate?