We have started training to our Model using below command
python3 DeepSpeech.py --train_files …/clips/train.csv –-train_batch_size 100 --train_cudnn --dev_files …/clips/dev.csv –-dev_batch_size 100 --test_files …/clips/test.csv –-test_batch_size 100 --log_level 0
We are using 8GB GPU and also its getting used 100%
We are using approx. 12000 dataset to train the model.
Average length of 16k sampling audio files is 60 seconds.
Few initial logged lines:
there must be at least one NUMA node, so returning NUMA node zero
2021-02-01 09:45:06.567988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:00:1e.0
2021-02-01 09:45:06.568049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-02-01 09:45:06.568082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-02-01 09:45:06.568103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-02-01 09:45:06.568132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-02-01 09:45:06.568152: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-02-01 09:45:06.568184: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-02-01 09:45:06.568210: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-02-01 09:45:06.568353: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-01 09:45:06.569026: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-01 09:45:06.569596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2021-02-01 09:45:06.569642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-01 09:45:06.569664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0
2021-02-01 09:45:06.569674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N
2021-02-01 09:45:06.569798: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-01 09:45:06.570423: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-01 09:45:06.571006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7171 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)
WARNING:tensorflow:From /home/ubuntu/DeepSpeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py:71: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
W0201 09:45:06.574804 139693934528320 deprecation.py:323] From /home/ubuntu/DeepSpeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py:71: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
D Session opened.
I Loading best validating checkpoint from /home/ubuntu/.local/share/deepspeech/checkpoints/best_dev-85345
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
2021-02-01 09:45:09.087454: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-02-01 09:45:09.684289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Epoch 0 | Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 16.107815
Epoch 0 | Training | Elapsed Time: 0:00:01 | Steps: 2 | Loss: 39.940804
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 3 | Loss: 41.197852
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 4 | Loss: 41.779977
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 5 | Loss: 47.854725
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 6 | Loss: 50.081092
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 7 | Loss: 54.767900
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 8 | Loss: 50.239637
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 9 | Loss: 52.046041
As I am new to DeepSpeech training, have few questions:
- we faced some interrupts in training due to system reboot, etc. when we started the training using the above command we could see in the logs that earlier checkpoint is picked up. But we everytime we could see training starts from Epoch 0.
So is the earlier training getting saved and starting from where it stoppped or each time its a new start?
- Why is it taking so much time for training using GPUs?
We are using 12200 dataset to train the model.
Average duaration of 16k sampling audio files is 60 seconds.
Logged lines:
Epoch 0 | Training | Elapsed Time: 7:07:08 | Steps: 12200 | Loss: 873.814420
Epoch 0 | Training | Elapsed Time: 7:07:08 | Steps: 12200 | Loss: 873.814420
Is the training elapsed time proper or is it taking longer than expected?
-
What is the ideal Epoch training count required to train the Model?
-
What is the ideal each audio file duration required to train? Also what will be the ideal dataset count to be used to train the Model?
We are using 12200 files to train