Can we use DeepSpeech for Vietnamese Speech To Text?

phanthanhlong7695 · February 22, 2018, 2:48am

@lissyx
i save all file UTF-8. it still error invalid label

lissyx · February 22, 2018, 7:54am

You need to find where it comes from: you have some transcription that has characters not in the alphabet :). There’s nothing we can do more for now.

phanthanhlong7695 · February 22, 2018, 8:01am

yeah. thank you for your help.
for example. in windown: mà
and in linux : ma`
maybe error
i’m try to fix this. thank for your support

lissyx · February 22, 2018, 8:04am

The best I can suggest for this is simply binary search: open your train CSV (if it happens during training), remove the first half of the lines, try to re-run. If it works, then the first half contains the offending character. If not, it’s in the second half, and you restart the process by removing half of the second half, until you get ONE line

phanthanhlong7695 · February 22, 2018, 8:14am

i know. but i can not create trie file. so how could i start trainning. i am still follow instruction.

lissyx · February 22, 2018, 8:21am

Well, you don’t need the trie file for training. Worst case, you can just apply the same process with generate_trie.

phanthanhlong7695 · February 22, 2018, 8:24am

that mean i can delete this :
–lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie
that right ?

lissyx · February 22, 2018, 10:05am

What is this ? Where is this coming from ?

phanthanhlong7695 · February 22, 2018, 10:12am

come from there .
DEEPSPEECH/bin/run-alfred.sh

lissyx · February 22, 2018, 10:23am

I’m a bit lost now in the status of your system. When do you have the invalid label error ? At training or during trie creation ? Why do you try to use a trie made for french on vietnamese data ?

phanthanhlong7695 · February 22, 2018, 10:30am

invalid label during trie creation

phanthanhlong7695 · February 22, 2018, 10:35am

i did not use trie for french. i find how to create trie for Vietnamese ?

lissyx · February 22, 2018, 10:46am

We are circling here. You need to create it. Vincent documented it in his thread. If you are hitting the invalid label during its creation, you need to find what is missing in your alphabet.

phanthanhlong7695 · February 22, 2018, 10:56am

i am trying

elpimous_robot · February 22, 2018, 11:01am

Hi, @phanthanhlong7695,

To create Trie file, you need some parts :

alphabet.txt
lm.binary
vocabulary.txt

invalid label during trie creation : it seems that you have unknown characters in your vocabulary.
a “label” is a character (a letter, or a punctuation)
check that all caracters in your vocabulary are present in alphabet.

If not, correct it, and restart all process.

lissyx · February 22, 2018, 11:39am

Well, I cannot do it for you, and I have much other work to perform. I gave you a process to find what is broken. Apply it.

jageshmaharjan · May 8, 2018, 8:05am

Hi @lissyx and @elpimous_robot. I did everything of English Corpus and everything went well from creating language model using kenLM tool. And generating trie file with generate_tire. My alphabets were in english, and trained successfully later with DeepSpeech.py.

But, the story for chinese characters is different. I have a chinese corpus in my vocabulary.txt file, i am using KenLM tool to generate chinese_lm.binary. But, stuck up at this point. is there anything that i need to with unicode.
This is my command to generate arap file, with KenLm,

/bin/lmplz -o 3 <~/Desktop/zh_deepspeech_small/vocabulary.txt> words.arpa --discount_fallback 1

And, i got this error;

/home/shenzhen/PycharmProjects/DeepSpeech/kenlm/util/scoped.cc:20 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested’.
Cannot allocate memory for 5104689640 bytes in malloc
Aborted (core dumped)

My vocabulary file consist of chinese characters (not pinyin)
eg:

七十年代末我外出求学母亲叮咛我吃饭要细嚼慢咽学习要深钻细研
陈云同志同时要求干部认真学习业务精通业务向一切业务内行人学习

elpimous_robot · May 8, 2018, 11:17am

Hello,
Look at kenlm website, there is a command to allocate space/ram for building…

Duc_Huy_Nguyen · April 11, 2020, 7:18am

@phanthanhlong7695 xin chào bạn bạn có thể cho mình xin email trao đổi chút được không, hiện tại mình cũng đang làm xử lí giọng nói bằng tiếng việt. Có gì bạn để lại mail hoặc liên hệ mình qua mail huynd@viegrid.com nhé

MaarufB · January 25, 2022, 6:57am

Hi sir, I have a question about your transcript in chinese? Did you add space separator between each character? Sorry im not a chinese but this is for research. Thanks in advance!