More than 2 days training

With the spanish dataset (5GB from https://voice.mozilla.org/es/datasets) with Dockerfile with Google Compute Server with:

  • n1-standard-2 (2 vCPUs, 7,5 GB de memoria)
  • 1 x NVIDIA Tesla K80
  • Disk 250 GB

and running

python3 ./../DeepSpeech.py --train_files clips/train.csv --dev_files clips/dev.csv --test_files clips/test.csv --train_batch_size 128 --dev_batch_size 128 --test_batch_si
ze 128 --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.20 --epochs 75 --lm_alpha 0.75 --lm_beta 1.85 --export_dir export/ --checkpoint_dir export/ --export_language es --alphabet_config_path ./../alphabet.txt --scorer ./../data/lm/kenlm.scorer

I’ve been here for more than 2 days and in this step:

I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:02:26 | Steps: 1 | Loss: 271.899628
Epoch 0 |   Training | Elapsed Time: 0:05:02 | Steps: 2 | Loss: 222.278564
Epoch 0 |   Training | Elapsed Time: 0:07:44 | Steps: 3 | Loss: 199.541595
Epoch 0 |   Training | Elapsed Time: 0:10:32 | Steps: 4 | Loss: 182.779682
Epoch 0 |   Training | Elapsed Time: 0:13:29 | Steps: 5 | Loss: 164.832791
Epoch 0 |   Training | Elapsed Time: 0:16:29 | Steps: 6 | Loss: 153.213351
Epoch 0 |   Training | Elapsed Time: 0:19:31 | Steps: 7 | Loss: 144.940323
Epoch 0 |   Training | Elapsed Time: 0:22:33 | Steps: 8 | Loss: 137.074270
Epoch 0 |   Training | Elapsed Time: 0:25:42 | Steps: 9 | Loss: 130.677231
Epoch 0 |   Training | Elapsed Time: 0:28:50 | Steps: 10 | Loss: 126.708546
Epoch 0 |   Training | Elapsed Time: 0:31:58 | Steps: 11 | Loss: 123.005626
Epoch 0 |   Training | Elapsed Time: 0:35:09 | Steps: 12 | Loss: 119.870375
Epoch 0 |   Training | Elapsed Time: 0:38:22 | Steps: 13 | Loss: 117.243926
Epoch 0 |   Training | Elapsed Time: 0:41:35 | Steps: 14 | Loss: 115.363463
Epoch 0 |   Training | Elapsed Time: 0:44:51 | Steps: 15 | Loss: 113.836318
Epoch 0 |   Training | Elapsed Time: 0:48:07 | Steps: 16 | Loss: 112.211874
Epoch 0 |   Training | Elapsed Time: 0:51:27 | Steps: 17 | Loss: 110.579662
Epoch 0 |   Training | Elapsed Time: 0:54:44 | Steps: 18 | Loss: 109.162176
Epoch 0 |   Training | Elapsed Time: 0:58:04 | Steps: 19 | Loss: 108.104136
Epoch 0 |   Training | Elapsed Time: 1:01:26 | Steps: 20 | Loss: 107.648077
Epoch 0 |   Training | Elapsed Time: 1:04:51 | Steps: 21 | Loss: 106.782394
Epoch 0 |   Training | Elapsed Time: 1:08:18 | Steps: 22 | Loss: 106.333886
Epoch 0 |   Training | Elapsed Time: 1:11:44 | Steps: 23 | Loss: 105.654570
Epoch 0 |   Training | Elapsed Time: 1:15:15 | Steps: 24 | Loss: 105.428457
Epoch 0 |   Training | Elapsed Time: 1:18:46 | Steps: 25 | Loss: 104.798543
Epoch 0 |   Training | Elapsed Time: 1:22:20 | Steps: 26 | Loss: 104.487207
Epoch 0 |   Training | Elapsed Time: 1:25:57 | Steps: 27 | Loss: 104.230118
Epoch 0 |   Training | Elapsed Time: 1:29:33 | Steps: 28 | Loss: 104.001480
Epoch 0 |   Training | Elapsed Time: 1:33:09 | Steps: 29 | Loss: 103.869217
Epoch 0 |   Training | Elapsed Time: 1:36:50 | Steps: 30 | Loss: 103.782717
Epoch 0 |   Training | Elapsed Time: 1:40:31 | Steps: 31 | Loss: 103.799426
Epoch 0 |   Training | Elapsed Time: 1:44:12 | Steps: 32 | Loss: 103.719172
Epoch 0 |   Training | Elapsed Time: 1:47:56 | Steps: 33 | Loss: 103.740066
Epoch 0 |   Training | Elapsed Time: 1:51:45 | Steps: 34 | Loss: 103.571454
Epoch 0 |   Training | Elapsed Time: 1:55:31 | Steps: 35 | Loss: 103.614604
Epoch 0 |   Training | Elapsed Time: 1:59:20 | Steps: 36 | Loss: 103.584468
Epoch 0 |   Training | Elapsed Time: 2:03:08 | Steps: 37 | Loss: 103.708817
Epoch 0 |   Training | Elapsed Time: 2:06:57 | Steps: 38 | Loss: 103.906307
Epoch 0 |   Training | Elapsed Time: 2:10:52 | Steps: 39 | Loss: 103.933918
Epoch 0 |   Training | Elapsed Time: 2:14:46 | Steps: 40 | Loss: 104.321078
Epoch 0 |   Training | Elapsed Time: 2:18:41 | Steps: 41 | Loss: 104.641836
Epoch 0 |   Training | Elapsed Time: 2:22:40 | Steps: 42 | Loss: 104.735922
Epoch 0 |   Training | Elapsed Time: 2:26:39 | Steps: 43 | Loss: 104.971744
Epoch 0 |   Training | Elapsed Time: 2:30:39 | Steps: 44 | Loss: 105.218100
Epoch 0 |   Training | Elapsed Time: 2:34:43 | Steps: 45 | Loss: 105.342824
Epoch 0 |   Training | Elapsed Time: 2:38:47 | Steps: 46 | Loss: 105.573358
Epoch 0 |   Training | Elapsed Time: 2:42:50 | Steps: 47 | Loss: 105.699668
Epoch 0 |   Training | Elapsed Time: 2:46:56 | Steps: 48 | Loss: 106.018716
Epoch 0 |   Training | Elapsed Time: 2:50:59 | Steps: 49 | Loss: 106.180050
Epoch 0 |   Training | Elapsed Time: 2:55:02 | Steps: 50 | Loss: 106.472491
Epoch 0 |   Training | Elapsed Time: 2:59:11 | Steps: 51 | Loss: 106.709407
Epoch 0 |   Training | Elapsed Time: 3:03:20 | Steps: 52 | Loss: 106.975191
Epoch 0 |   Training | Elapsed Time: 3:07:29 | Steps: 53 | Loss: 107.313255
Epoch 0 |   Training | Elapsed Time: 3:11:41 | Steps: 54 | Loss: 107.645746
Epoch 0 |   Training | Elapsed Time: 3:15:54 | Steps: 55 | Loss: 107.833867
Epoch 0 |   Training | Elapsed Time: 3:20:10 | Steps: 56 | Loss: 108.080491
Epoch 0 |   Training | Elapsed Time: 3:24:31 | Steps: 57 | Loss: 108.405488
Epoch 0 |   Training | Elapsed Time: 3:28:49 | Steps: 58 | Loss: 108.767881
Epoch 0 |   Training | Elapsed Time: 3:33:11 | Steps: 59 | Loss: 109.118651
Epoch 0 |   Training | Elapsed Time: 3:37:37 | Steps: 60 | Loss: 109.468564
Epoch 0 |   Training | Elapsed Time: 3:42:05 | Steps: 61 | Loss: 109.781096
Epoch 0 |   Training | Elapsed Time: 3:46:35 | Steps: 62 | Loss: 109.950108
Epoch 0 |   Training | Elapsed Time: 3:51:15 | Steps: 63 | Loss: 110.234678 

How long can it take to finish? Do any of you recommend a good configuration of Google Compute? or other?

Can I optimize the parameters to be faster but without losing much accuracy?

Thanks for your help!!

That doesn’t look like it’s using the gpu.

1 Like

Thanks!! :sweat_smile: I have to install https://github.com/NVIDIA/nvidia-docker

1 Like