DeepSpeech training voice sample duration

i am working on Indian English. i took voice samples of 9 to 10 sec. in my audio samples, speaker speaks too fast. that is why many words comes under 10 sec voice files. do you thik it is too big for deepspeech training. if you think it is too big, i will reduce the voice samples from 10 sec to 5 sec duration. please review the voice transcript and let me just know if each line transcript is too big for deepspeech training

i am using 10 sec voice files with nvdia gtx1080 4 core gpu.

training parameters are
–train_batch_size 24
–test_batch_size 48
–n_hidden 2048
–epochs 3
–learning_rate 0.0001
–dropout_rate 0.2
–lm_alpha 0.75
–lm_beta 1.85 \

so can i assume this gpu setup and training parameters are fine for my 10 sec voice transcripts. or should i take 5 sec voice samples instead of 10 sec to reduce high no of words per voice samples.

That’s 8GB RAM or 11GB RAM ?

Basically, maximum audio sample length depends on:

  • how many RAM your GPU has
  • how big your batches are
  • how you fit short / long audio together

If your audio samples are too big and you push too much in each batch, then this will not fit in your GPU memory and you will get GPU OOM errors: the forum is full of people asking help on that.

Our importers limits each sample to ~10-15 secs, depending on each case, so that you can achieve good batch size on 8GB/11GB RAM GPUs.

There’s no one-size-fits-all value, though, you need to experiment in your own case.

in my Indian voice transcript, speaker speaks too fast and pronounces 30 words average in 10 sec as given in the snapshot above. so can i assume that is fine with my 11 GB gpu setup with train/test batch size of 24/48.

or should i continue with 10 sec voice duration till that time of getting GPU OOM error. i want to be sure at first about that. if i get OOM error after building 100 hour data, i have to do lot of rework again to convert each 10 sec data again to 5 sec data. i want to avoid that rework and that is why to be sure that i am using appropriate duration to avoid future rework arising out of OOM gou error.

I don’t understand your statement here, what’s the link between pace of your speakers and the batch size ?

so what i have understood there is no upper threshold of words per voice samples in 10/12 sec duration until i get OOM GPU error.

a slow speaker can speaks 5 words in 10 sec, a fast speaker can pronounce 40 words in 10 sec. deepspeech training will be done successfully in both cases provided that OOM GPU error does not happen. am i right.

Yes, we only care about the audio length, not the amount of words spoken.

that’s great, this is the thing i want to be clear, thanks a lot @lissyx