I wanted to train my model on full http://openslr.org/37/ dataset. I use Google Colab for training, so I ran the preprocess script from util folder to build the feature caches.
The whole dataset contains ~200,000
files. My training set contains almost ~160,000
files (80% of total data).
When I ran the preprocess script, it was working although preprocessing were taking a long time and using high cpu/memory/IO. It took almost 3/4 hrs to process ~100,000 files (I was using less -N +F
to monitor progress) but the my computer froze after that. I had to force shut down it after 2/3 hrs and no cache was saved to disk
.
How may I run the preprocess script to build feature caches efficiently and avoid losing all progress?