Unable to train model with default checkpoints and frozen model

gr8nishan · March 14, 2018, 9:53am

Hi,

I am trying to train on my custom data set. There are two approaches that i am trying one is to adapt on the default frozen model and second is to use the default checkpoints. I have attached my system configuration.

When i am training from the checkpoint this is how i am giving my command

 python -u  DeepSpeech.py \
  --train_files /home/hdpuser/models/csv/TRAIN/TRAIN.csv \
  --dev_files /home/hdpuser/models/csv/DEV/DEV.csv \
  --test_files /home/hdpuser/models/csv/TEST/TEST.csv \
  --n_hidden 2048 \
  --epoch 3 \
  --export_dir /home/hdpuser/models/results/model_export/ \
  --lm_binary_path /home/hdpuser/models/lm.binary \
  --checkpoint_dir /home/hdpuser/models/results/checkout/ \
  --decoder_library_path /home/hdpuser/DeepSpeech/NativeClient/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /home/hdpuser/models/alphabet.txt \
  --lm_trie_path /home/hdpuser/models/trie \
  --summary_dir /home/hdpuser/models/summary \
  --validation_step 1 \
  --limit_train 2 \
  --limit_test 2 \
  --limit_dev 2 \

When i am trying to train from the frozen model this my command

python -u  DeepSpeech.py \
  --train_files /home/hdpuser/models/csv/TRAIN/TRAIN.csv \
  --dev_files /home/hdpuser/models/csv/DEV/DEV.csv \
  --test_files /home/hdpuser/models/csv/TEST/TEST.csv \
  --initialize_from_frozen_model /home/hdpuser/models/output_graph.pb \
  --n_hidden 2048 \
  --epoch 3 \
  --export_dir /home/hdpuser/models/results/model_export/ \
  --lm_binary_path /home/hdpuser/models/lm.binary \
  --checkpoint_dir /home/hdpuser/models/results/checkout/ \
  --decoder_library_path /home/hdpuser/DeepSpeech/NativeClient/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /home/hdpuser/models/alphabet.txt \
  --initialize_from_frozen_model /home/hdpuser/models/output_graph.pb \
  --lm_trie_path /home/hdpuser/models/trie \
  --summary_dir /home/hdpuser/models/summary \
  --validation_step 1 \
  --limit_train 2 \
  --limit_test 2 \
  --limit_dev 2 \

Both these process get automatically KILLED ans when i use dmesg I get the following

**Out of memory: Kill process 3382 (python) score 402 or sacrifice child**
** Killed process 3382 (python) total-vm:10642244kB, anon-rss:6961516kB, file-rss:0kB**

Can anyone please help with this and point out where exactly the issue is.

Thanks in advance

lissyx · March 14, 2018, 11:32am

Your kernel already gave you the answer: not enough memory. So OOM killed your process.
You seem to have only 10GB. Can you share more details on your configuration ? Also please avoid images: it’s unusable, not searchable, hard to read.

gr8nishan · March 14, 2018, 12:30pm

This is what df -H gives

Filesystem                     Size  Used Avail Use% Mounted on
    /dev/mapper/rhel_redhat7-root   60G   24G   37G  40% /
    devtmpfs                       4.1G     0  4.1G   0% /dev
    tmpfs                          4.1G   87k  4.1G   1% /dev/shm
    tmpfs                          4.1G   68M  4.1G   2% /run
    tmpfs                          4.1G     0  4.1G   0% /sys/fs/cgroup
    /dev/sda1                      521M  202M  320M  39% /boot
    tmpfs                          820M   17k  820M   1% /run/user/42
    tmpfs                          820M     0  820M   0% /run/user/1001

This is what vmstat -s gives

 16391852 K total memory
      8774156 K used memory
        23140 K active memory
        94240 K inactive memory
      7357224 K free memory
            0 K buffer memory
       260472 K swap cache
      4079612 K total swap
       235604 K used swap
      3844008 K free swap
      1112900 non-nice user cpu ticks
          145 nice user cpu ticks
       233230 system cpu ticks
     35209299 idle cpu ticks
        37630 IO-wait cpu ticks
            0 IRQ cpu ticks
         4681 softirq cpu ticks
            0 stolen cpu ticks
     29324743 pages paged in
     71190650 pages paged out
       585651 pages swapped in
      9146113 pages swapped out
     44323896 interrupts
     51223100 CPU context switches
   1520846821 boot time
       203306 forks

Exception in dmesg

Out of memory: Kill process 7861 (python) score 522 or sacrifice child
 Killed process 7861 (python) total-vm:12894796kB, anon-rss:6854196kB, file-rss:0kB

Hope it helps. Let me know in case you are looking for any specific details

lissyx · March 14, 2018, 12:57pm

So you have 8GB RAM, 4GB swap. Looks like it’s not enough, that’s all.

gr8nishan · March 14, 2018, 1:34pm

What is the configuration that would be required to train the models

lissyx · March 14, 2018, 1:37pm

That depends on the size of the model and how much you have free. Likely at least 16GB.

tanner · December 7, 2018, 4:47am

10 posts were split to a new topic: Error when training model

Topic		Replies	Views
Frozen model was not export, stopped before that testing completed(killed process) DeepSpeech	8	468	October 23, 2018
How much disk space is required for training Deepspeech model DeepSpeech	16	2593	February 22, 2018
Running Deepspeech 0.7.4 on Google Commands Dataset DeepSpeech	24	1154	July 24, 2020
Memory usage during training DeepSpeech	16	1820	June 3, 2019
Training error DeepSpeech	10	1481	November 21, 2019

Unable to train model with default checkpoints and frozen model

Related topics