I have a supercomputer at my hands, how to train DeepSpeech in Parallel?

I’d like to train a model of the German Language CommonVoice data (and publish it for free, too). Therefore, I have a supercomputer at my hands with many GPUs, but I do not have root rights. I already created a singularity container, https://github.com/NormanTUD/TrainDeepSpeechGerman , with which I can train DeepSpeech. But it takes a really long time.

The architecture of my supercomputer is that there are many nodes with good GPUs.

Is there any way to utilize this? Can I easily train DeepSpeech in parallel, so that results from one node affects all the results on other nodes?

We use Slurm as batch management system, if this is important. Any more information that may help you you can ask for and I’ll try to deliver it. Also, I use DeepSpeech 0.6.1 (the latest release from GitHub).

It would be great to successfully utilize this and create a good german language model for publication. If this works, I could also create models for other languages and also release them.

It looks like your are not using high batch sizes. Try 64 or 128 for train and dev. This should increase it a lot :slight_smile:

And cudnn_rnn and maybe automatic_mixed_precision

And I would go with current master. A model trained on that could easily be used for transfer learning.

If you need more material, look at this repo: https://github.com/AASHISHAG/deepspeech-german

Did you check you’re using the gpus? I’m too trying to train with slurm on a dgx and i found that tensorflow did not find my gpus after converting my dockerimage (build from mozillas dockerfile) to a singularity image.

Test it by calling: python3 -c "import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))"

You can find my full slurm setups here.
And thanks for the link to your repo, it helped me solve another error i had:)

I already trained some german models with deepspeech on my standard pc, you can find the training instructions and my results at: https://github.com/DanBmh/deepspeech-german

1 Like

Thanks for both of your replies and sorry for the delay of my part. I’ve re-started the job with those parameters:
TRAINBATCHSIZE = 128
TESTBATCHSIZE = 256
DEVBATCHSIZE = 128
and it’s been running for 2 days now. The loss starts very bad (around 273), after 1 day and 1 hour it gets to 141 and then get’s worse again. Now it’s at 150. What can I do against this? I’ve added the option --early_stop true, but that doesn’t seem to do anything.

I’ve tried to run the code you suggested for checking whether the GPUs get utilized, it returns this:

srun --gres=gpu:1 --time="00:05:00" singularity exec image/ds.img python3 -c "import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))"
srun: job 19406794 queued and waiting for resources
srun: job 19406794 has been allocated resources
2020-03-29 18:10:32.007729: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100205000 Hz
2020-03-29 18:10:32.008090: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5b6eb00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-29 18:10:32.008106: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-29 18:10:32.010862: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2020-03-29 18:10:32.010882: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-03-29 18:10:32.010923: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: HOSTNAME
2020-03-29 18:10:32.010935: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: HOSTNAME
2020-03-29 18:10:32.010979: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2020-03-29 18:10:32.011029: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
2020-03-29 18:10:32.011183: I tensorflow/core/common_runtime/direct_session.cc:359] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Does that mean it get’s to use the GPU? Or is “libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program” a problem?

What can I do about this?

On a decent GPU server the whole German Common Voice takes about 3 hours to run into early stop or 30 minutes per epoch. Get cudnn_rnn to work and check CUDA and cuDNN versions against the requirements. Important are Train and Dev sizes as the test set is rather small and runs only once, but they look OK.

Get CUDA to work, running on CPUs is pointless for that dataset if you could use GPUs

What WER did you get with your training?

Nice stats on github Dan :slight_smile:

Common Voice alone will get you to WER .15 in itself, but it depends on what you want to use your model on.