Help regarding validating my current approach for training common voice dataset

Hi i have a paid azure vm deployed specially for training Mozilla Deep Speech models with common voice dataset so that speech recognition will be very accurate.

My VM Specs:
6 cores vcpu
56 gb ram
356 gb ssd
1 x Nvidia Tesla K80 Gpu

Cost( $ 1.23 / hour)

But i am a newbie to Deep Speech and i wanted to clarify my approach and understanding so that i do not want to waste the paid hours of my vm and end up paying for nothing.

So what i wanted is to train the common voice data set with best accuracy.

I am aware of the guide for training common voice data set.

Hence I am following the following approach:

    cd DeepSpeech
    pip3 install -r requirements.txt
    pip3 install $(python3 util/taskcluster.py --decoder)
    pip3 uninstall tensorflow
    pip3 install 'tensorflow-gpu==1.14.0'
  1. Downloaded the English common voice data set and extracted to folder en

so current directory has a folder en in it

  1. CommonVoice v2.0 importer

sudo apt-get install sox

sudo apt-get install -y libsox-dev

DeepSpeech/bin/import_cv2.py --filter DeepSpeech/data/alphabet.txt en

so current directory has a folder en /clips

  1. Starting training and giving model output directory - models
    and checkpoint to directory - checkoints

so current directory has a folder en , models , checkpoints

DeepSpeech/DeepSpeech.py --train_files en/clips/train.csv  \
 --dev_files en/clips/dev.csv --test_files en/clips/test.csv  \
--automatic_mixed_precision=True --checkpoint_dir checkpoints \
 --export_dir models

Now just a rookie question will this process terminate itself or i have to terminate it my self after many epochs and low loss?

  1. Now i am confused for the next steps. Will the models made in the --export_dir can be used directly in the code ? or are there any steps to perform later?

  2. (Optional Question) Given the VM Specs how much time will it take(approx) to train for achieving highest accuracy? (for vm cost prediction)

Please dont mind my rookie questions as i am newbie who had just recently started learning ML , TF etc

I suggest to use Google Colab to set up and understand how it works.

Yes.

If you want the memory mapped model you will need to use the tool to convert the .pb.

My Azure VM with a single K80 using 400h-600h took about 1 day and a half per epoch(before cudnn patch). About Azure VM I would suggest first take a look to low priority or spot scale sets.

See pricing here:https://azure.microsoft.com/en-ca/pricing/details/batch/

Hi @carlfm01 thanks for answering my question

Unfortunately for me colab is not useful for training this big data set as it will get reset after a given time.

Secondly,

I just have 3 more doubts if you can please clear it :grinning:

  1. Will this training process terminate itself or i have to terminate it my self after many epochs with low loss?

  2. What is a memory mapped model and what are the benefits of creating it?

  3. You mentioned something regarding cudnn patch. So running the below script will use the cudnn approach directly or are there any flags required to add to the script?

DeepSpeech/DeepSpeech.py --train_files en/clips/train.csv  \
--dev_files en/clips/dev.csv --test_files en/clips/test.csv  \
--automatic_mixed_precision=True --checkpoint_dir checkpoints \
--export_dir models

Don’t expect too much of K80 GPUs.

Check the documentation and --help, you may want to use early stopping feature.

Please be clear about “directly in the code”. As documented --export_dir produces ready to use models. But you don’t need to run that on the machine that has the GPU. You can train without that, copy back checkpoints and perform exports on your local system.

Hard to tell. It depends on the amount of data you have, on the performances of teh GPU as well as of the whole machine, and on how much epochs you perform.

Please document yourself about mmap().

Again, this is explained in the documentation and the releases notes. By default, training will use CUDNN specificities to speep up training. When you want to run on non-CUDNN setup, you need to pass the matching flag.

Hi;
Thanks for your help so with my research from your answer i manged to understand that

  1. For great accuracy it will take time to train on the k80 gpus approx(15-20 epochs)

  2. 1 epoch takes around 24 hours to complete(on k80 GPUS) so roughly it will take around 20-25 days to train the model for optimum accuracy

  3. If i want to terminate the script all by itself i need to pass –epoch flag and it will store the models in the –export_dir path and then they can be directly use in the code.
    So about this , what i mean by using it directly in the code is

  • I want to use the trained models the same way we use pre-trained Language models
  • So i will simply replace the model path of the pre-trained model with the model path of my trained models
  1. Also i got to know that there is a flag –use_cudnn_rnn true flag for using the cudnn approach
    Also i note that i have a gpu system and i wont be needing anu cpu models.

So the confusion here is

So what is the need of this flag –use_cudnn_rnn true if by default it will use CUDNN specificities to speep up training. As i had seen that by default -use_cudnn_rnn is set to false

And lastly if you can please validate these research that is these correct or am i wrong at any point and clear these doubts i will be ready to start training.

No you’re right, we don’t default.

That’s … vague …

24h ? Are you sure GPU is properly leveraged ? How much data do you have ?