Vocabluary.txt content

dys · October 1, 2020, 8:55am

I’m a bit confuesed about the vocabulary.txt.
Is it just every line a new transcript and does this need to match any order or only every transcript of the wave files in no special order or just a bunch of text to get the statistics of letters?
Greetings

othiele · October 1, 2020, 9:03am

The vocabulary is used for the language model and has nothing to do with the wavs. Usually, take as much written material in your language as you can. And it is always about statistics of words, not letters

See this repo for an example how to do that even though model generation is still 0.6 based so you can see what type of data to use. For newer versions >0.7 you build a scorer as described in the docs.

Topic		Replies	Views
How to get good transcription results with only a specific English vocabulary? DeepSpeech	15	1769	June 3, 2020
Where is Vocab.txt file? DeepSpeech	10	2443	April 5, 2019
Explanation DeepSpeech	1	291	February 17, 2020
Does vocab.txt need to be sorted to create language model? DeepSpeech	0	301	April 19, 2019
Building LM, noticed vocab.txt and librispeech-lm-norm.txt have a lot of low quality words DeepSpeech	3	1410	December 7, 2018

Vocabluary.txt content

Related topics