Hi i have a paid azure vm deployed specially for training Mozilla Deep Speech models with common voice dataset so that speech recognition will be very accurate.
My VM Specs:
6 cores vcpu
56 gb ram
356 gb ssd
1 x Nvidia Tesla K80 Gpu
Cost( $ 1.23 / hour)
But i am a newbie to Deep Speech and i wanted to clarify my approach and understanding so that i do not want to waste the paid hours of my vm and end up paying for nothing.
So what i wanted is to train the common voice data set with best accuracy.
I am aware of the guide for training common voice data set.
Now just a rookie question will this process terminate itself or i have to terminate it my self after many epochs and low loss?
Now i am confused for the next steps. Will the models made in the --export_dir can be used directly in the code ? or are there any steps to perform later?
(Optional Question) Given the VM Specs how much time will it take(approx) to train for achieving highest accuracy? (for vm cost prediction)
Please dont mind my rookie questions as i am newbie who had just recently started learning ML , TF etc
I suggest to use Google Colab to set up and understand how it works.
Yes.
If you want the memory mapped model you will need to use the tool to convert the .pb.
My Azure VM with a single K80 using 400h-600h took about 1 day and a half per epoch(before cudnn patch). About Azure VM I would suggest first take a look to low priority or spot scale sets.
Unfortunately for me colab is not useful for training this big data set as it will get reset after a given time.
Secondly,
I just have 3 more doubts if you can please clear it
Will this training process terminate itself or i have to terminate it my self after many epochs with low loss?
What is a memory mapped model and what are the benefits of creating it?
You mentioned something regarding cudnn patch. So running the below script will use the cudnn approach directly or are there any flags required to add to the script?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
Don’t expect too much of K80 GPUs.
Check the documentation and --help, you may want to use early stopping feature.
Please be clear about “directly in the code”. As documented --export_dir produces ready to use models. But you don’t need to run that on the machine that has the GPU. You can train without that, copy back checkpoints and perform exports on your local system.
Hard to tell. It depends on the amount of data you have, on the performances of teh GPU as well as of the whole machine, and on how much epochs you perform.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
5
Please document yourself about mmap().
Again, this is explained in the documentation and the releases notes. By default, training will use CUDNN specificities to speep up training. When you want to run on non-CUDNN setup, you need to pass the matching flag.
Hi;
Thanks for your help so with my research from your answer i manged to understand that
For great accuracy it will take time to train on the k80 gpus approx(15-20 epochs)
1 epoch takes around 24 hours to complete(on k80 GPUS) so roughly it will take around 20-25 days to train the model for optimum accuracy
If i want to terminate the script all by itself i need to pass –epoch flag and it will store the models in the –export_dir path and then they can be directly use in the code.
So about this , what i mean by using it directly in the code is
I want to use the trained models the same way we use pre-trained Language models
So i will simply replace the model path of the pre-trained model with the model path of my trained models
Also i got to know that there is a flag –use_cudnn_rnn true flag for using the cudnn approach
Also i note that i have a gpu system and i wont be needing anu cpu models.
So the confusion here is
So what is the need of this flag –use_cudnn_rnn true if by default it will use CUDNN specificities to speep up training. As i had seen that by default -use_cudnn_rnn is set to false
And lastly if you can please validate these research that is these correct or am i wrong at any point and clear these doubts i will be ready to start training.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
No you’re right, we don’t default.
That’s … vague …
24h ? Are you sure GPU is properly leveraged ? How much data do you have ?