Unable to train model with default checkpoints and frozen model


(Gr8nishan) #1

Hi,

I am trying to train on my custom data set. There are two approaches that i am trying one is to adapt on the default frozen model and second is to use the default checkpoints. I have attached my system configuration.

When i am training from the checkpoint this is how i am giving my command

 python -u  DeepSpeech.py \
  --train_files /home/hdpuser/models/csv/TRAIN/TRAIN.csv \
  --dev_files /home/hdpuser/models/csv/DEV/DEV.csv \
  --test_files /home/hdpuser/models/csv/TEST/TEST.csv \
  --n_hidden 2048 \
  --epoch 3 \
  --export_dir /home/hdpuser/models/results/model_export/ \
  --lm_binary_path /home/hdpuser/models/lm.binary \
  --checkpoint_dir /home/hdpuser/models/results/checkout/ \
  --decoder_library_path /home/hdpuser/DeepSpeech/NativeClient/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /home/hdpuser/models/alphabet.txt \
  --lm_trie_path /home/hdpuser/models/trie \
  --summary_dir /home/hdpuser/models/summary \
  --validation_step 1 \
  --limit_train 2 \
  --limit_test 2 \
  --limit_dev 2 \

When i am trying to train from the frozen model this my command

python -u  DeepSpeech.py \
  --train_files /home/hdpuser/models/csv/TRAIN/TRAIN.csv \
  --dev_files /home/hdpuser/models/csv/DEV/DEV.csv \
  --test_files /home/hdpuser/models/csv/TEST/TEST.csv \
  --initialize_from_frozen_model /home/hdpuser/models/output_graph.pb \
  --n_hidden 2048 \
  --epoch 3 \
  --export_dir /home/hdpuser/models/results/model_export/ \
  --lm_binary_path /home/hdpuser/models/lm.binary \
  --checkpoint_dir /home/hdpuser/models/results/checkout/ \
  --decoder_library_path /home/hdpuser/DeepSpeech/NativeClient/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /home/hdpuser/models/alphabet.txt \
  --initialize_from_frozen_model /home/hdpuser/models/output_graph.pb \
  --lm_trie_path /home/hdpuser/models/trie \
  --summary_dir /home/hdpuser/models/summary \
  --validation_step 1 \
  --limit_train 2 \
  --limit_test 2 \
  --limit_dev 2 \

Both these process get automatically KILLED ans when i use dmesg I get the following

**Out of memory: Kill process 3382 (python) score 402 or sacrifice child**
** Killed process 3382 (python) total-vm:10642244kB, anon-rss:6961516kB, file-rss:0kB**

Can anyone please help with this and point out where exactly the issue is.

Thanks in advance :slight_smile:


(Lissyx) #2

Your kernel already gave you the answer: not enough memory. So OOM killed your process.
You seem to have only 10GB. Can you share more details on your configuration ? Also please avoid images: it’s unusable, not searchable, hard to read.


(Gr8nishan) #3

This is what df -H gives

Filesystem                     Size  Used Avail Use% Mounted on
    /dev/mapper/rhel_redhat7-root   60G   24G   37G  40% /
    devtmpfs                       4.1G     0  4.1G   0% /dev
    tmpfs                          4.1G   87k  4.1G   1% /dev/shm
    tmpfs                          4.1G   68M  4.1G   2% /run
    tmpfs                          4.1G     0  4.1G   0% /sys/fs/cgroup
    /dev/sda1                      521M  202M  320M  39% /boot
    tmpfs                          820M   17k  820M   1% /run/user/42
    tmpfs                          820M     0  820M   0% /run/user/1001

This is what vmstat -s gives

 16391852 K total memory
      8774156 K used memory
        23140 K active memory
        94240 K inactive memory
      7357224 K free memory
            0 K buffer memory
       260472 K swap cache
      4079612 K total swap
       235604 K used swap
      3844008 K free swap
      1112900 non-nice user cpu ticks
          145 nice user cpu ticks
       233230 system cpu ticks
     35209299 idle cpu ticks
        37630 IO-wait cpu ticks
            0 IRQ cpu ticks
         4681 softirq cpu ticks
            0 stolen cpu ticks
     29324743 pages paged in
     71190650 pages paged out
       585651 pages swapped in
      9146113 pages swapped out
     44323896 interrupts
     51223100 CPU context switches
   1520846821 boot time
       203306 forks

Exception in dmesg

Out of memory: Kill process 7861 (python) score 522 or sacrifice child
 Killed process 7861 (python) total-vm:12894796kB, anon-rss:6854196kB, file-rss:0kB

Hope it helps. Let me know in case you are looking for any specific details


(Lissyx) #4

So you have 8GB RAM, 4GB swap. Looks like it’s not enough, that’s all.


(Gr8nishan) #5

What is the configuration that would be required to train the models


(Lissyx) #6

That depends on the size of the model and how much you have free. Likely at least 16GB.


(Tanner Filip) #7

10 posts were split to a new topic: Error when training model