@lissyx Working on fine-tuning. Running into an error. My thoughts suspect OOM but not sure.
Cuda and cuDNN info:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
root@d7284da3dc5c:/DeepSpeech# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
fine-tuning bash script:
#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
echo "Please make sure you run this from DeepSpeech's top level directory."
exit 1
fi;
export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
#export TF_FORCE_GPU_ALLOW_GROWTH=true
python3 -u DeepSpeech.py \
--train_files //tmp/external/google_cmds_csvs/train.csv \
--test_files //tmp/external/google_cmds_csvs/test.csv \
--dev_files //tmp/external/google_cmds_csvs/dev.csv \
--epochs 5 \
--train_batch_size 1 \
--dev_batch_size 1 \
--test_batch_size 1 \
--export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ \
--use_allow_growth \
--n_hidden 2048 \
--train_cudnn \
--checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/ \
"$@"
I imagine this error means my GPU is running out of memory? Probably can’t handle the n_hidden 2048? Even with batch sizes equal to 1…
I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 2 | Loss: 25.073557
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 3 | Loss: 22.607900
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 4 | Loss: 19.332927
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 5 | Loss: 18.299621
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 6 | Loss: 22.737631
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 7 | Loss: 21.567622
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 8 | Loss: 23.387867
2020-07-30 15:45:48.370974: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
2020-07-30 15:45:48.371022: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:830] Non-OK-status: GpuLaunchKernel(ColumnReduceKernel<IN_T, OUT_T, Op>, grid_dim, block_dim, 0, cu_stream, in, out, extent_x, extent_y, op, init) status: Internal: unknown error
Fatal Python error: Aborted
Thread 0x00007f5199679700 (most recent call first):
File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
File "/usr/lib/pyAborted