DeepSpeech training voice sample duration

halder.nayan35 · January 13, 2020, 5:03pm

i am working on Indian English. i took voice samples of 9 to 10 sec. in my audio samples, speaker speaks too fast. that is why many words comes under 10 sec voice files. do you thik it is too big for deepspeech training. if you think it is too big, i will reduce the voice samples from 10 sec to 5 sec duration. please review the voice transcript and let me just know if each line transcript is too big for deepspeech training

i am using 10 sec voice files with nvdia gtx1080 4 core gpu.

training parameters are
–train_batch_size 24
–test_batch_size 48
–n_hidden 2048
–epochs 3
–learning_rate 0.0001
–dropout_rate 0.2
–lm_alpha 0.75
–lm_beta 1.85 \

so can i assume this gpu setup and training parameters are fine for my 10 sec voice transcripts. or should i take 5 sec voice samples instead of 10 sec to reduce high no of words per voice samples.

lissyx · January 13, 2020, 5:13pm

That’s 8GB RAM or 11GB RAM ?

Basically, maximum audio sample length depends on:

how many RAM your GPU has
how big your batches are
how you fit short / long audio together

If your audio samples are too big and you push too much in each batch, then this will not fit in your GPU memory and you will get GPU OOM errors: the forum is full of people asking help on that.

Our importers limits each sample to ~10-15 secs, depending on each case, so that you can achieve good batch size on 8GB/11GB RAM GPUs.

There’s no one-size-fits-all value, though, you need to experiment in your own case.

halder.nayan35 · January 13, 2020, 5:24pm

in my Indian voice transcript, speaker speaks too fast and pronounces 30 words average in 10 sec as given in the snapshot above. so can i assume that is fine with my 11 GB gpu setup with train/test batch size of 24/48.

or should i continue with 10 sec voice duration till that time of getting GPU OOM error. i want to be sure at first about that. if i get OOM error after building 100 hour data, i have to do lot of rework again to convert each 10 sec data again to 5 sec data. i want to avoid that rework and that is why to be sure that i am using appropriate duration to avoid future rework arising out of OOM gou error.

lissyx · January 13, 2020, 5:27pm

I don’t understand your statement here, what’s the link between pace of your speakers and the batch size ?

halder.nayan35 · January 13, 2020, 5:35pm

so what i have understood there is no upper threshold of words per voice samples in 10/12 sec duration until i get OOM GPU error.

a slow speaker can speaks 5 words in 10 sec, a fast speaker can pronounce 40 words in 10 sec. deepspeech training will be done successfully in both cases provided that OOM GPU error does not happen. am i right.

lissyx · January 13, 2020, 5:42pm

Yes, we only care about the audio length, not the amount of words spoken.

halder.nayan35 · January 13, 2020, 5:44pm

that’s great, this is the thing i want to be clear, thanks a lot @lissyx

Topic		Replies	Views
Information on training and inferring audio file length DeepSpeech	5	1136	August 15, 2018
Audio files for Deepspeech DeepSpeech	1	436	June 24, 2019
Step, epoch, hardware, weird Duration DeepSpeech	8	606	July 1, 2020
Long Training Time DeepSpeech	13	616	April 14, 2020
Help regarding validating my current approach for training common voice dataset DeepSpeech learning , feedback	6	1017	December 16, 2019

DeepSpeech training voice sample duration

Related topics