Can we use DeepSpeech for Vietnamese Speech To Text?


(Phanthanhlong7695) #1

i have alphabet.txt, vocaburary.txt, using kenlm, create lm.binary.
But i can not create trie file.


(Lissyx) #2

Thanks, but you are spamming everywhere with your question, and not answering when I’m asking details: https://github.com/mozilla/DeepSpeech/issues/1202#issuecomment-362543236

So please answer and stop spamming everywhere.


(Phanthanhlong7695) #3

i am so so sory.
i am really need help.
this is my fault
i am build Vietnamese Language Model.
i’m using kenlm to create lm.binary. it’s ok. but i can not create trie file.
can u help me?
and i am very very sory about that.


(Lissyx) #4

Stop being sorry and just document what issue exactly you have with ./generate_trie. In the issue you opened, it was obviously not being built correctly. Please document exactly your steps so we can help you :slight_smile:


(Phanthanhlong7695) #5

first, i dowloaded generate_trie from https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.md.
second, i use kenlm to create lm.binary .
then i run command:
./generate_trie .data/alphabet.txt .data/lm.binary .data/vocabulary.txt .data/trie
errror: bash: ./generate_trie: cannot execute binary file: Exec format error
i do not fix that. so can you help me with that?


(Lissyx) #6

You need to download the binaries: https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz

It contains generate_trie


(Phanthanhlong7695) #7

./generate_trie data/alphabet.txt data/lm.binary data/vocabulary.txt data/trie

error: Invalid label 
Aborted (core dumped)

i think maybe DEEPSPEECH do not support Vietnamese Language :cold_sweat::cold_sweat::cold_sweat:


(Lissyx) #8

Invalid label means you lack something. Did you made any change to alphabet.txt ?


(Vincent Foucault) #9

Hello,
Perhaps you tried to simplify alphabet.txt !! Bad idea !!
an example of alphabet.txt :

alphabet.txt :

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
# FOR FRENCH LIMITED CORPUS - JUST WORKING IN SOUND PERCEPTION - A BOT WILL ANALYSE RESULTS
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v

Don’t forget blank line…
Good luck


(Phanthanhlong7695) #10

My alphabet here

a
ă
â
b
c
d
đ
e
ê
g
h
i
k
l
m
n
o
ô
ơ
p
q
r
s
t
u
ư
v
x
y
à
á

ã











è
é




ế



ì
í

ĩ

ò
ó

õ











ù
ú

ũ







ý



(Lissyx) #11

Okay, so it seems like you have properly added vietnamese characters. Can you ensure you have all those used in vocab ?

I remember similar issues because of characters encoded in UTF-8 on one side, and in simple ASCII on the other, yet representing the same symbol :slight_smile:


(Lissyx) #12

@phanthanhlong7695 See that issue https://github.com/mozilla/DeepSpeech/issues/1107, that kind of tooling would find / help in your case for example.


(Vincent Foucault) #13

@phanthanhlong7695,
Hi.
Be sure to save your file as utf8 one.
ex : gedit -> save as -> bottom left : encoding UTF-8 (and not iso-xxx)
I had problems with auto save (bad encoding)


(Jaredoptimus1) #14

I am having the same issue. I am trying to train Deep Speech for Bangla language.
Invalid label ও
Aborted (core dumped)
It seems that this error is occurring while reading the vocabulary.txt file. Because character ও is the first character of vocabulary.txt.


(Lissyx) #15

Please check issue linked above, and what I already explained earlier: you need to find what is missing in your vocabulary.txt


(Phanthanhlong7695) #16

@jaredoptimus1 do you fix that.
i See that issue https://github.com/mozilla/DeepSpeech/issues/1107 but nothing in there


(Lissyx) #17

At the expense of being repetitive, this issue is just mismatching between characters used in the training data and in the alphabet. You should just add the missing ones.


(Phanthanhlong7695) #18

@lissyx
for example: ma, mà, má, mả, mã, mạ
`, ~, ?, . i think this characters can’t stand alone, it must be go with a word?
i think so


(Lissyx) #19

As long as you add those combined characters using proper unicode, it should work. Make sure you use the same unicode between alphabet and transcriptions. If you have any error raising about invalid label, it means we might have something bogus somewhere in our UTF-8 handling, and thus we will need more infos on how to reproduce.


(Phanthanhlong7695) #20

@lissyx
i save all file UTF-8. it still error invalid label