I am currently trying to create my own language model with Mozilla Deepspeech, but when I start the process, it rapidly ends displaying the following error:
KeyError: "ERROR: Your transcripts contain characters (e.g. 'c') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."
First, I checked many times in my alphabet.txt which is located in an other path than the one displayed above. Both contain the ācā character, along with the other characters present in my dataset transcriptions.
And then, why is it searching in data/alphabet.txt even if I expressly tell him to get this file in another location?
Do not hesitate to ask me for other information, code piecesā¦
Thanks by advance for your help!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Are you sure you are passing this alphabet everywhere ? If somehow some component relies in the default data/alphabet.txt then you would get tricked.
Which language are you working on ?
Thereās mostly nothing we can do with that report except asking you to re-check, this code is well exercized now and if itās outputing that error, then you really have a mismatch between your dataset and your alphabet ā¦
Do you get the same alphabet generated with python util/check_characters.py --csv-files [...] --alphabet-format ?
That is why I copied and replaced the data/alphabet.txt with my own, but it does not seem to change anythingā¦
Sorry, I am creating a generic french language model.
I thought the same but even after hours of search I cannot find where I did something wrongā¦
I tried the command you gave me, copied/pasted the alphabet output in both the default and my own alphabets, but the same error outputs again, maybe I should re-install Deepspeech and the other modules.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
At that point, I guess you really need to explain what you do from scratch / how you run things.
Thereās no āinstall deepspeechā involved in the training process, for example.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
5
You should not need sudo
Iām also unsure about your path, they contain a lot of backslashes as well as spaces, maybe itās breaking some of the tensorflow flag parsing logic and it does not pick up your path ?
I never thought that it would bother in any way but yeah I will try with more conventional paths, youāre right.
So I am working on a Ubuntu 18.04.3 Virtual Machine with the dataset being on a shared folder so I can work with it easily with both my host and client machines.
I need the sudo command because I am targeting the dataset located in my shared folder, seen by Ubuntu as an external drive.
Looks like the last line is not an emptyline , I usually start with this alphabet file and add characters later. It has 33 lines:
# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
# The last (non-comment) line needs to end with a newline.
I checked again and again, but the alphabet seems ok, no formatting issue, the newline is present at the end of the file, with the c letter on first position! But the issue remains the sameā¦
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
Even if you generate the alphabet from util/check_characters.py ? Please donāt rely on any copy/paste procedure, just pipe that to a file. Thereās no reason it would not work, except if there is something really odd in your dataset ā¦
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
I already thought of it and rewrote myself the letters without copy/pasting (via nano/gedit, or even outside the VM with VSCode) but it didnāt help at all
I am getting a similar error but except for a letter Iām getting a whole sentenceā¦I donāt why is it looking into a whole sentenceā¦can anyone help?
My alphabet.txt contains all the alphabets a-z and space.
Here is the error Iām getting,
āERROR: Your transcripts contain characters (e.g. āshow me todays vitalsā) which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt.ā
Ask for any specifications you think Iāll need to put to help resolve the issue.