Error while training alphabet, says it is missing characters

Hi,

I am currently trying to create my own language model with Mozilla Deepspeech, but when I start the process, it rapidly ends displaying the following error:

KeyError: "ERROR: Your transcripts contain characters (e.g. 'c') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."

First, I checked many times in my alphabet.txt which is located in an other path than the one displayed above. Both contain the ā€˜cā€™ character, along with the other characters present in my dataset transcriptions.

And then, why is it searching in data/alphabet.txt even if I expressly tell him to get this file in another location?

Below is the content of my alphabet.txt file:


t
r
a
n
s
c
i
p
v
o
l
Ć 
 
e
q
u
d
Ć©
f
m
x
j
'
h
g
ĆØ
y
b
Ć¹
Ƨ
ĆŖ
Ć“
z
Ć¢
œ
Ć®
k
Ć»
w

And here is the command I execute to start the training.

sudo python -u DeepSpeech.py   --train_files /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/train/train.csv   --dev_files /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/dev/dev.csv   --test_files /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/test/test.csv   --train_batch_size 80   --dev_batch_size 80   --test_batch_size 40   --n_hidden 375   --epochs 33   --early_stop True   --es_steps 6   --es_mean_th 0.1   --es_std_th 0.1   --dropout_rate 0.22   --learning_rate 0.00095   --report_count 100   --export_dir /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/results/model_export/   --checkpoint_dir /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/results/checkout/   --alphabet_config_path /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/alphabet.txt   --lm_binary_path /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/lm.binary   --lm_trie_path /media/sf_\[VRT\]_Debian_STT_v1/Language\ Model/script_preparation_data/trie

Do not hesitate to ask me for other information, code piecesā€¦

Thanks by advance for your help!

Are you sure you are passing this alphabet everywhere ? If somehow some component relies in the default data/alphabet.txt then you would get tricked.

Which language are you working on ?

Thereā€™s mostly nothing we can do with that report except asking you to re-check, this code is well exercized now and if itā€™s outputing that error, then you really have a mismatch between your dataset and your alphabet ā€¦

Do you get the same alphabet generated with python util/check_characters.py --csv-files [...] --alphabet-format ?

Hi, thanks for the fast answer!

That is why I copied and replaced the data/alphabet.txt with my own, but it does not seem to change anythingā€¦

Sorry, I am creating a generic french language model.

I thought the same but even after hours of search I cannot find where I did something wrongā€¦

I tried the command you gave me, copied/pasted the alphabet output in both the default and my own alphabets, but the same error outputs again, maybe I should re-install Deepspeech and the other modules.

Could you please have a look at https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/ and share the effort ?

At that point, I guess you really need to explain what you do from scratch / how you run things.

Thereā€™s no ā€œinstall deepspeechā€ involved in the training process, for example.

You should not need sudo

Iā€™m also unsure about your path, they contain a lot of backslashes as well as spaces, maybe itā€™s breaking some of the tensorflow flag parsing logic and it does not pick up your path ?

I never thought that it would bother in any way but yeah I will try with more conventional paths, youā€™re right.

So I am working on a Ubuntu 18.04.3 Virtual Machine with the dataset being on a shared folder so I can work with it easily with both my host and client machines.

I need the sudo command because I am targeting the dataset located in my shared folder, seen by Ubuntu as an external drive.

I mainly followed the tutorial made by elpimous_robot (TUTORIAL : How I trained a specific french model to control my robot), adapting the links and commands on purpose, and using the 2018 french audio dataset provided by CommonVoice.

Well that tutorial might not be completely uptodate

Yeah but then itā€™s hard to know what python env etc is actually getting loaded

Without more context on your error, thatā€™s the best I can bet.

But honestly, it would be cool if you could work around the work being done on french model, so that efforts are put in common.

1 Like

Looks like the last line is not an emptyline , I usually start with this alphabet file and add characters later. It has 33 lines:

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
# The last (non-comment) line needs to end with a newline.

Last line is empty

2 Likes

Hello, thanks for your answer!

I checked again and again, but the alphabet seems ok, no formatting issue, the newline is present at the end of the file, with the c letter on first position! But the issue remains the sameā€¦

Even if you generate the alphabet from util/check_characters.py ? Please donā€™t rely on any copy/paste procedure, just pipe that to a file. Thereā€™s no reason it would not work, except if there is something really odd in your dataset ā€¦

1 Like

Like: https://github.com/Common-Voice/commonvoice-fr/blob/e61a59e2de0e43cb2da82d32ff75a8d5457b41c5/DeepSpeech/generate_alphabet.sh#L14-L16

1 Like

Effectively, piping the result of util/check_characters.py into alphabet.txt resolved the issueā€¦ Iā€™d have never thought itā€™d come from there :frowning:

Sorry for the newbie error!

Also, I will definitely have a look on commonvoice-fr so we can work better together.

Thanks again.

2 Likes

Very much likely you have some UTF-8 character looking like ASCII one. Copy/pasting could kill it.

I already thought of it and rewrote myself the letters without copy/pasting (via nano/gedit, or even outside the VM with VSCode) but it didnā€™t help at all

I am getting a similar error but except for a letter Iā€™m getting a whole sentenceā€¦I donā€™t why is it looking into a whole sentenceā€¦can anyone help?
My alphabet.txt contains all the alphabets a-z and space.

Here is the error Iā€™m getting,
ā€œERROR: Your transcripts contain characters (e.g. ā€˜show me todays vitalsā€™) which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt.ā€

Ask for any specifications you think Iā€™ll need to put to help resolve the issue.

What version of the training code are you using? Did you modify the code?

Did you use the util/check_characters.py script located in the DeepSpeech project?

I am using Deepspeech v0.7.0-alpha.2

Yes, in util/feeding.py on line 122 I changed from

df[ā€˜transcriptā€™] = df.apply(text_to_char_array, alphabet=Config.alphabet, result_type=ā€˜reduceā€™, axis=1)

to

df[ā€˜transcriptā€™] = text_to_char_array(df, alphabet=Config.alphabet)

Yes I didā€¦I created alphabet.txt using the characters given by check_characters.py and still I am getting the error.

can you check is there a white-space after characters ?