Can we use DeepSpeech for Vietnamese Speech To Text?

phanthanhlong7695 · February 8, 2018, 9:42am

i have alphabet.txt, vocaburary.txt, using kenlm, create lm.binary.
But i can not create trie file.

lissyx · February 8, 2018, 10:51am

Thanks, but you are spamming everywhere with your question, and not answering when I’m asking details: https://github.com/mozilla/DeepSpeech/issues/1202#issuecomment-362543236

So please answer and stop spamming everywhere.

phanthanhlong7695 · February 8, 2018, 3:03pm

i am so so sory.
i am really need help.
this is my fault
i am build Vietnamese Language Model.
i’m using kenlm to create lm.binary. it’s ok. but i can not create trie file.
can u help me?
and i am very very sory about that.

lissyx · February 8, 2018, 3:17pm

Stop being sorry and just document what issue exactly you have with ./generate_trie. In the issue you opened, it was obviously not being built correctly. Please document exactly your steps so we can help you

phanthanhlong7695 · February 9, 2018, 2:43am

first, i dowloaded generate_trie from https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.md.
second, i use kenlm to create lm.binary .
then i run command:
./generate_trie .data/alphabet.txt .data/lm.binary .data/vocabulary.txt .data/trie
errror: bash: ./generate_trie: cannot execute binary file: Exec format error
i do not fix that. so can you help me with that?

lissyx · February 9, 2018, 7:24am

You need to download the binaries: https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz

It contains generate_trie

phanthanhlong7695 · February 9, 2018, 9:12am

./generate_trie data/alphabet.txt data/lm.binary data/vocabulary.txt data/trie

error: Invalid label
Aborted (core dumped)

i think maybe DEEPSPEECH do not support Vietnamese Language

lissyx · February 9, 2018, 11:12am

Invalid label means you lack something. Did you made any change to alphabet.txt ?

elpimous_robot · February 9, 2018, 9:46pm

Hello,
Perhaps you tried to simplify alphabet.txt !! Bad idea !!
an example of alphabet.txt :

alphabet.txt :

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
# FOR FRENCH LIMITED CORPUS - JUST WORKING IN SOUND PERCEPTION - A BOT WILL ANALYSE RESULTS
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v

Don’t forget blank line…
Good luck

phanthanhlong7695 · February 10, 2018, 3:06am

My alphabet here

a
ă
â
b
c
d
đ
e
ê
g
h
i
k
l
m
n
o
ô
ơ
p
q
r
s
t
u
ư
v
x
y
à
á
ả
ã
ạ
ằ
ắ
ẳ
ẵ
ặ
ầ
ấ
ẩ
ẫ
ậ
è
é
ẻ
ẽ
ẹ
ề
ế
ể
ễ
ệ
ì
í
ỉ
ĩ
ị
ò
ó
ỏ
õ
ọ
ồ
ố
ổ
ỗ
ộ
ờ
ớ
ở
ỡ
ợ
ù
ú
ủ
ũ
ụ
ừ
ứ
ử
ữ
ự
ỳ
ý
ỷ
ỹ
ỵ

lissyx · February 10, 2018, 10:12am

Okay, so it seems like you have properly added vietnamese characters. Can you ensure you have all those used in vocab ?

I remember similar issues because of characters encoded in UTF-8 on one side, and in simple ASCII on the other, yet representing the same symbol

lissyx · February 10, 2018, 11:16am

@phanthanhlong7695 See that issue https://github.com/mozilla/DeepSpeech/issues/1107, that kind of tooling would find / help in your case for example.

elpimous_robot · February 12, 2018, 8:03pm

@phanthanhlong7695,
Hi.
Be sure to save your file as utf8 one.
ex : gedit -> save as -> bottom left : encoding UTF-8 (and not iso-xxx)
I had problems with auto save (bad encoding)

jaredoptimus1 · February 12, 2018, 8:38pm

I am having the same issue. I am trying to train Deep Speech for Bangla language.
Invalid label ও
Aborted (core dumped)
It seems that this error is occurring while reading the vocabulary.txt file. Because character ও is the first character of vocabulary.txt.

lissyx · February 12, 2018, 8:45pm

Please check issue linked above, and what I already explained earlier: you need to find what is missing in your vocabulary.txt

phanthanhlong7695 · February 21, 2018, 3:35am

@jaredoptimus1 do you fix that.
i See that issue https://github.com/mozilla/DeepSpeech/issues/1107 but nothing in there

lissyx · February 21, 2018, 10:05am

At the expense of being repetitive, this issue is just mismatching between characters used in the training data and in the alphabet. You should just add the missing ones.

phanthanhlong7695 · February 21, 2018, 10:31am

@lissyx
for example: ma, mà, má, mả, mã, mạ
`, ~, ?, . i think this characters can’t stand alone, it must be go with a word?
i think so

lissyx · February 21, 2018, 10:42am

As long as you add those combined characters using proper unicode, it should work. Make sure you use the same unicode between alphabet and transcriptions. If you have any error raising about invalid label, it means we might have something bogus somewhere in our UTF-8 handling, and thus we will need more infos on how to reproduce.

phanthanhlong7695 · February 22, 2018, 2:48am

@lissyx
i save all file UTF-8. it still error invalid label