Cleaning Transcript Files (Invalid label when building trie)


#1

Hello,

Similar to issue #1083 on Github, I encountered

Invalid label <symbol>
Aborted (core dumped)

when running

path/to/folder/generate_trie alphabet.txt transcript.binary transcript.txt trie

I fixed the above error message for many symbols and characters I didn’t think were in the transcript by simply adding them to my alphabet.txt file. This is a temporary solution before I go back and figure out where the symbols are.

However, now I’ve encountered the following:

Invalid label
Aborted (core dumped)

when running

path/to/folder/generate_trie alphabet.txt transcript.binary transcript.txt trie

In other words, it seems like the invalid label is empty. I added an “empty” to alphabet.txt by just creating a line with two spaces and nothing else, but that didn’t seem to help.

It seems like the problem is my transcript has some unknown character messing up the alphabet.h file (seen here at https://github.com/mozilla/DeepSpeech/blob/master/native_client/alphabet.h) at line 50.

Any suggestions as to how I can debug this on my end?


#2

I debugged it. Turns out an easier solution to trying to get print statements to work is to just delete half of the transcript in an iterative fashion until you find the offending line of code.

The odd symbol turned out to be some weird encoded period that didn’t work (so it showed up as a square box with some letters/numbers in it).


(Shanehmi) #3

dear i have some issue when i using my mozilla, some time different sites not working properly, like facebook and other social sites, even when i try to open other different sites like Techriation mozilla could not open these sites in first attempt. kindly tel me the solution?
should i re-install my mozilla or not???


#4

Hello Dear,

This forum is particularly for DeepSpeech, a Mozilla-made deep neural network for speech recognition.

However, reinstalling Firefox should be simple enough. Perhaps you could also explain what was wrong with the sites, your computer’s OS, and how you are installing Firefox?


(Phanthanhlong7695) #5

what should i do to create trie -.-
help me


(P Holetzky) #6

This tutorial has a section that describes the TRIE creation. What part are you stuck at?

Short version: you use the generate_trie for trie generation in the native_client folder of the DeepSpeech project, using alphabet.txt, lm.binary, vocabulary.txt as input.

You can download Mozillas Model to get an alphabet.txt file for reference.
lm.binary is built with Kenlm tools.
Vocabulary.txt is a file with your wave file transcripts.

(According to my understanding)


(Phanthanhlong7695) #7

i know. but wen i run :
./generate_trie data/alphabet.txt data/lm.binary data/vocabulary.txt data/trie

=> error:

Invalid label K

Aborted (core dumped)


help me

#8

Hello,

I believe your problem is the capital letter “K”. The DeepSpeech data only recognizes the lowercase letters ( a to z ) , an apostrophe ( ’ ), and spaces ( ). Thus, you need to remove the capital K from your vocabulary.txt file. There are likely other problems as well, so make sure nothing in your vocabulary.txt file is not a-z, ', or a space.

I actually used the
Invalid label: <thing>
error report to help me figure out which unaccepted symbols were in my vocabulary.txt. From there, I went into a scripting language and changed all of the unaccepted symbols to acceptable symbols (changed K to k, change + to plus, etc).


(Phanthanhlong7695) #9

Thank for your help . :smiley:

Vào 27 Th1, 2018 12:07 PM, “DJ-Hay” discourse@mozilla-community.org đã viết:


(Phanthanhlong7695) #10

can i use that for Vietnamese ?
it still error.
Invalid label

Aborted (core dumped)


#11

Hello,
So whenever it says

Invalid label <blank>

it means there is still some non a-z, ', or space symbol in your transcript. You indicate Vietnamese, so I’m assuming that the code is trying to tell you the Vietnamese symbol is invalid, but the computer isn’t easily able to display that Vietnamese symbol using the standard print function.


(P Holetzky) #12

I found a cleanup script somewhere in the DeepSpeech issues section and modified a little for german alphabet. It basically goes through every line/word/character, converts it to lower case and removes lines that contain invalid characters. Maybe this helps you aswell, if you modify it for your needs?

just copy the characters from your alphabet.txt into the labels string and change the infile path to your vocabulary.txt

# -*- coding: utf-8 -*-

infile = "./vocabulary.txt"
outfile = "./vocabulary_cleaned.txt"

labels = u"abcdefghijklmnopqrstuvwxyz-äöüß"

with open(infile, "r") as fi:
    with open(outfile, "w") as fo:
        for line in fi:
            line_flag = True
            line_str = ""
            wdlist = line.strip().split()
            for wd in wdlist:
                wd = wd.decode('utf-8')
                WD = wd.lower()
                word_flag = True
                for c in WD:
                    if not c in labels:
                        print(wd, WD)
                        word_flag = False
                        line_flag = False
                        continue
                if word_flag:
                    line_str += (WD + " ")
            if line_flag:
                fo.write(line_str.encode('utf-8') + "\n")

(Jageshmaharjan) #13

You dont have label “K” in your alphabets.txt file