Can we use DeepSpeech for Vietnamese Speech To Text?

@lissyx
i save all file UTF-8. it still error invalid label

You need to find where it comes from: you have some transcription that has characters not in the alphabet :). There’s nothing we can do more for now.

yeah. thank you for your help.
for example. in windown: mà
and in linux : ma`
maybe error :smiley:
i’m try to fix this. thank for your support

The best I can suggest for this is simply binary search: open your train CSV (if it happens during training), remove the first half of the lines, try to re-run. If it works, then the first half contains the offending character. If not, it’s in the second half, and you restart the process by removing half of the second half, until you get ONE line :slight_smile:

i know. but i can not create trie file. so how could i start trainning. i am still follow instruction. :slight_smile:

Well, you don’t need the trie file for training. Worst case, you can just apply the same process with generate_trie.

that mean i can delete this :
–lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie
that right ?

What is this ? Where is this coming from ?

come from there .
DEEPSPEECH/bin/run-alfred.sh

I’m a bit lost now in the status of your system. When do you have the invalid label error ? At training or during trie creation ? Why do you try to use a trie made for french on vietnamese data ?

invalid label during trie creation

i did not use trie for french. i find how to create trie for Vietnamese ?

We are circling here. You need to create it. Vincent documented it in his thread. If you are hitting the invalid label during its creation, you need to find what is missing in your alphabet.

i am trying :frowning:

Hi, @phanthanhlong7695,

To create Trie file, you need some parts :

  • alphabet.txt
  • lm.binary
  • vocabulary.txt

invalid label during trie creation : it seems that you have unknown characters in your vocabulary.
a “label” is a character (a letter, or a punctuation)
check that all caracters in your vocabulary are present in alphabet.

If not, correct it, and restart all process.

Well, I cannot do it for you, and I have much other work to perform. I gave you a process to find what is broken. Apply it.

Hi @lissyx and @elpimous_robot. I did everything of English Corpus and everything went well from creating language model using kenLM tool. And generating trie file with generate_tire. My alphabets were in english, and trained successfully later with DeepSpeech.py.

But, the story for chinese characters is different. I have a chinese corpus in my vocabulary.txt file, i am using KenLM tool to generate chinese_lm.binary. But, stuck up at this point. is there anything that i need to with unicode.
This is my command to generate arap file, with KenLm,

/bin/lmplz -o 3 <~/Desktop/zh_deepspeech_small/vocabulary.txt> words.arpa --discount_fallback 1

And, i got this error;

/home/shenzhen/PycharmProjects/DeepSpeech/kenlm/util/scoped.cc:20 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested’.
Cannot allocate memory for 5104689640 bytes in malloc
Aborted (core dumped)

My vocabulary file consist of chinese characters (not pinyin)
eg:

七十 年代 末 我 外出 求学 母亲 叮咛 我 吃饭 要 细嚼慢咽 学习 要 深 钻 细 研
陈云 同志 同时 要求 干部 认真学习 业务 精通 业务 向 一切 业务 内 行人 学习

Hello,
Look at kenlm website, there is a command to allocate space/ram for building…

@phanthanhlong7695 xin chào bạn bạn có thể cho mình xin email trao đổi chút được không, hiện tại mình cũng đang làm xử lí giọng nói bằng tiếng việt. Có gì bạn để lại mail hoặc liên hệ mình qua mail huynd@viegrid.com nhé

Hi sir, I have a question about your transcript in chinese? Did you add space separator between each character? Sorry im not a chinese but this is for research. Thanks in advance!