Format of Vocabulary file for creating trie

jahir · October 27, 2018, 3:25pm

The first 20 lines from the supplied vocab.txt file from DeepSpeech/data/lm folder are:

a
a''s
a''t
a'a
a'ad
a'ade
a'ain't
a'al
a'am
a'an
a'ana
a'andy
a'ane
a'ant
a'arf
a'b'c'd
a'b'cd
a'b'd'd'c'a
a'b'ilin

What is the usage of apostrophe here. More precisely what should each line of vocabulary file contain? Should each line simpy contain each word that appears in the language corpus? Or is there any special formatting required?

reuben · October 27, 2018, 5:16pm

data/lm/README.md explains how it was created, it’s just lower case words from the same corpus that was used to create the language model.

jahir · October 27, 2018, 7:24pm

I looked at the README.md and generated my vocabulary file using the given command. But I was somewhat confused seeing apostrophes in middle of word in the provided data/lm/vocab.txt file.
My language corpus does not have any apostrophe. So, if my vocabulary file contains the unique words that appear in the language corpus, it would be okay, right?

reuben · October 27, 2018, 8:27pm

Yes.

(Apparently replies need to be 20 characters or longer)