Where is Vocab.txt file?

sanjay.pandey · April 2, 2019, 9:20am

Hello @lissyx
I cannot find vocab file under data folder. I want to add some other english word to the language model for domain specific purpose. Parallely i am training the model on common voice dataset(500 hours,english) but if my purpose is served just by adding to the language model it will be pretty helpful. I am doing this for restaurant domain so need to collect their phone number and take the order of the said item from the menu. Please help!

lissyx · April 2, 2019, 9:22am

Please read data/lm/README.md.

sanjay.pandey · April 2, 2019, 9:35am

Thank you so much @lissyx for prompt reply didnt expected such speedup help.

I read README.md and what i understand is that i need to add more or specific sentence in to librispeech-lm-norm.txt right? as from there it is creating lm.binary and trie right?
Also i am training on 500 hours of common voice data and when i started training with all the default parameter from the deepspeech0.4 checkpoint for epoch -30.It started from 4th epoch and the loss was only 28 which was good thing but after 5th epoch loss increased to 36 and for 6th epoch it went to 42 and hence i stopped my model and then reduce the learning rate from 0.0001 to 0.00001. Am i doing correct or missing something?

lissyx · April 2, 2019, 9:38am

Yes

I can’t tell, please read other topics about that, and share more details on your training. Please remember that Common Voice is still in early stages, and some of it was already used to train 0.4.1, so you might be doing it wrong by re-training.

sanjay.pandey · April 2, 2019, 9:57am

Thank you @lissyx actually i am training it on common voice because deepspeech 0.4 model gives not so good inference on indian english accent so thought to use it as common voice consist of english of different accent.I have already reached cost of 860$ while training on gpu please tell me if it is futile to train on common voice then i will stop the training now.
Can you suggest me any other dataset specifically for indian accent?
Also does model need to be trained on some specific word like name of indian food or any food dish or it inference can be improved just by adding sentence which is useful for me?

lissyx · April 2, 2019, 10:02am

Please read the release notes, because we explicitely document that. Also, you should filter on indian accent from Common Voice, and honestly, I doubt we have 500h with indian accent.

I don’t have any, sorry

We could successfully test with released english model and a specific LM built with a set of commands, even worked with my poor french english accent. So maybe just creating a small LM with what you might expect ?

sanjay.pandey · April 2, 2019, 10:15am

Ok thank you so much.
So if i create language model using completely my own set of sentence/words instead of adding sentence to your released language model would it work? And please guide me on how to create my own language model. Would it be okay if i dont train the sentence which i include in vocab.txt.Can i directly include sentence into vocab.txt and create lm and trie and then use it for inference without training on that sentence?

lissyx · April 2, 2019, 10:18am

Please read the doc, it’s all explained.

I don’t think that training on what you have in vocab.txt is a good idea, it would overfit …

See previous reply

sanjay.pandey · April 2, 2019, 10:31am

Thank you so much @lissyx for guiding me and saving my expenses. Can you please provide me or guide me where is set of commands to build specific LM? I badly need it.

lissyx · April 2, 2019, 10:35am

Seriously ? It’s the third time I’m telling you to read the documentation, first link I gave.

sanjay.pandey · April 5, 2019, 6:03am

Okay @lissyx thank you for the help will try the same. Also while training the model on english dataset of mozilla voice. Changing learning rate from default(0.0001) to 0.00001 worked like charm. I am currently on 11th epoch and loss is decreasing and have reached to 3.28 . Will be training it for 35th epoch.I tried the inference when the loss was 8 and it was doing good then the current model on recorded voice where originally “Welcome to the menu screen” is being said. The deepspeech 0.4 model predicted it as “we then to the many screen” and the model which i trained it further having loss 8 predicted it as “we come to the many screen”. Small change but heading toward the right direction. Hope so combining the newly trained model and customising language model will do the job for me. Thank you so much for your prompt reply and to your whole team for releasing the model.

Topic		Replies	Views
Does vocab.txt need to be sorted to create language model? DeepSpeech	0	301	April 19, 2019
Building LM, noticed vocab.txt and librispeech-lm-norm.txt have a lot of low quality words DeepSpeech	3	1410	December 7, 2018
How can i add custom vocab.txt and build a language model lm.binary, trie for pretrained model v0.2.0 DeepSpeech	17	5787	April 11, 2019
If there is any possible ways is there to add my audio vocab.txt binaries into pretrained binaries for our deepspeech? DeepSpeech learning	3	1099	September 12, 2018
Vocabulary.txt what text it should contain DeepSpeech	4	766	April 24, 2020

Where is Vocab.txt file?

Related topics