I have a supercomputer at my hands, how to train DeepSpeech in Parallel?

Norman · March 22, 2020, 10:59am

I’d like to train a model of the German Language CommonVoice data (and publish it for free, too). Therefore, I have a supercomputer at my hands with many GPUs, but I do not have root rights. I already created a singularity container, https://github.com/NormanTUD/TrainDeepSpeechGerman , with which I can train DeepSpeech. But it takes a really long time.

The architecture of my supercomputer is that there are many nodes with good GPUs.

Is there any way to utilize this? Can I easily train DeepSpeech in parallel, so that results from one node affects all the results on other nodes?

We use Slurm as batch management system, if this is important. Any more information that may help you you can ask for and I’ll try to deliver it. Also, I use DeepSpeech 0.6.1 (the latest release from GitHub).

It would be great to successfully utilize this and create a good german language model for publication. If this works, I could also create models for other languages and also release them.

othiele · March 22, 2020, 11:35am

It looks like your are not using high batch sizes. Try 64 or 128 for train and dev. This should increase it a lot

And cudnn_rnn and maybe automatic_mixed_precision

And I would go with current master. A model trained on that could easily be used for transfer learning.

If you need more material, look at this repo: https://github.com/AASHISHAG/deepspeech-german

dan.bmh · March 29, 2020, 5:49pm

Did you check you’re using the gpus? I’m too trying to train with slurm on a dgx and i found that tensorflow did not find my gpus after converting my dockerimage (build from mozillas dockerfile) to a singularity image.

Test it by calling: python3 -c "import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))"

You can find my full slurm setups here.
And thanks for the link to your repo, it helped me solve another error i had:)

I already trained some german models with deepspeech on my standard pc, you can find the training instructions and my results at: https://github.com/DanBmh/deepspeech-german

Norman · March 29, 2020, 6:16pm

Thanks for both of your replies and sorry for the delay of my part. I’ve re-started the job with those parameters:
TRAINBATCHSIZE = 128
TESTBATCHSIZE = 256
DEVBATCHSIZE = 128
and it’s been running for 2 days now. The loss starts very bad (around 273), after 1 day and 1 hour it gets to 141 and then get’s worse again. Now it’s at 150. What can I do against this? I’ve added the option --early_stop true, but that doesn’t seem to do anything.

I’ve tried to run the code you suggested for checking whether the GPUs get utilized, it returns this:

srun --gres=gpu:1 --time="00:05:00" singularity exec image/ds.img python3 -c "import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))"
srun: job 19406794 queued and waiting for resources
srun: job 19406794 has been allocated resources
2020-03-29 18:10:32.007729: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100205000 Hz
2020-03-29 18:10:32.008090: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5b6eb00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-29 18:10:32.008106: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-29 18:10:32.010862: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /.singularity.d/libs
2020-03-29 18:10:32.010882: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-03-29 18:10:32.010923: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: HOSTNAME
2020-03-29 18:10:32.010935: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: HOSTNAME
2020-03-29 18:10:32.010979: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2020-03-29 18:10:32.011029: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
2020-03-29 18:10:32.011183: I tensorflow/core/common_runtime/direct_session.cc:359] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Does that mean it get’s to use the GPU? Or is “libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program” a problem?

What can I do about this?

othiele · March 29, 2020, 6:57pm

On a decent GPU server the whole German Common Voice takes about 3 hours to run into early stop or 30 minutes per epoch. Get cudnn_rnn to work and check CUDA and cuDNN versions against the requirements. Important are Train and Dev sizes as the test set is rather small and runs only once, but they look OK.

othiele · March 29, 2020, 6:58pm

Get CUDA to work, running on CPUs is pointless for that dataset if you could use GPUs

dan.bmh · March 30, 2020, 12:29pm

What WER did you get with your training?

othiele · March 30, 2020, 12:38pm

Nice stats on github Dan

Common Voice alone will get you to WER .15 in itself, but it depends on what you want to use your model on.

Norman · April 15, 2020, 9:58am

I finally got it to work with the GPU, but now I face another problem:

...
2020-04-15 11:34:03.647509: W ./tensorflow/core/util/ctc/ctc_loss_calculator.h:499] No valid path found.
2020-04-15 11:34:03.647860: W ./tensorflow/core/util/ctc/ctc_loss_calculator.h:499] No valid path found.
2020-04-15 11:34:03.648242: W ./tensorflow/core/util/ctc/ctc_loss_calculator.h:499] No valid path found.
E The following files caused an infinite (or NaN) loss: /scratch/ws/s3811141-ds/TrainDeepSpeechGerman/dstrain/clips/common_voice_de_18691971.wav,/scratch/ws/s3811141-ds/TrainDeepSpeechGerman/dstrain/clips/common_voice_de_19128727.wav,/scratch/ws/s3811141-ds/TrainDeepSpeechGerman/dstrain/clips/common_voice_de_19515912.wav,...

This seems to happen in the second step always, from then on, the loss remains Inf. All the files exist, and all of them are playable (thus not corrupted or something).

I see no discernible pattern in the files or in the descriptions of the files.

I tried removing all of the files that cause this with e.g.

for csv in *.csv; do
...
sed --in-place '/common_voice_de_18560738.wav/d' $csv > /dev/null
...
done

but then other files will do the same.

lissyx · April 15, 2020, 10:25am

Are they too short / too long ? Or even maybe broken ?

Norman · April 15, 2020, 10:52am

Some of them are short, some are relatively long, but none seems way too long, and none is broken. I’ve play'ed a couple of them and none was defunct and all of them said exactly what the csv wanted them to say.

othiele · April 15, 2020, 11:53am

audiomate is a great tool to use different corpuses and they have a list of corrupt common_voice files. That helped me in the past:

github.com

ynop/audiomate/blob/master/audiomate/corpus/io/data/common-voice/invalid_utterances.json

[
  "common_voice_de_17304025",
  "common_voice_de_17304237",
  "common_voice_de_17312340",
  "common_voice_de_17312459",
  "common_voice_de_17318558",
  "common_voice_de_17337465",
  "common_voice_de_17359556",
  "common_voice_de_17427362",
  "common_voice_de_17430374",
  "common_voice_de_17430388",
  "common_voice_de_17431076",
  "common_voice_de_17507821",
  "common_voice_de_17517255",
  "common_voice_de_17551990",
  "common_voice_de_17619678",
  "common_voice_de_17623181",
  "common_voice_de_17635068",
  "common_voice_de_17637960",
  "common_voice_de_17639623",

This file has been truncated. show original

Norman · April 16, 2020, 10:54am

Ok cool. Now that I removed those files and did some other things together with a workmate, it now trains on the GPU. Once a model is finished training, I will make it available.
Thanks for the support until now!

Norman · May 31, 2020, 6:44pm

Hi again,
for some time now, the super computer is down due to a virus that has spread around in europe’s super computer centers. But very shortly before it was shut down, I was able to get a model to work. You can download it at

along with some (admittely very hacky) software I wrote that allows you to control your computer in German.

When the super computer is back up again, I will continue training. But right now, the results are already kind of OK, at least on my computer.

othiele · May 31, 2020, 6:48pm

Great, could you tell us a bit more about the training parameters and results? And what did you use to build the scorer?

That would be wonderful

Norman · May 31, 2020, 6:56pm

I got the scorer from here -> https://www.kaggle.com/mischi001/germanlmkenlm?select=de_kenlm.scorer+2 and I cannot access the super computer right now, so sadly I cannot tell you the hyperparameters.
I’ll release the full log of the train, the checkpoints and so on once the computer is back up again, but sadly, right now nobody knows when this will be the case.

othiele · May 31, 2020, 7:13pm

There might soon be a bigger scorer for German:

Yours is quite small to detect a lot. And maybe you want to contritbute to mycroft as your are sort of re-building functionality. Join the movement

https://chat.mycroft.ai/community/channels/deepspeech-stt

Norman · July 13, 2020, 9:21am

For now, the super computer is back up. I’ve uploaded the best model I’ve got yet with all of it’s checkpoints and so on here:
https://mega.nz/file/bZxy0SRJ#F6eB5T4WtVxOuv29W7Ny8LLvAfQd6j8l42I-KyaJHtY
It’s freely available for anyone. For ease of use, I chose WTFPL as a license.

othiele · July 13, 2020, 9:29am

Great news, @Norman is there some metadata on what audio you used, how you split train/dev/test and the resulting WER?

Feel free to post it here for others to use