Ideal length of training recordings

I have a question concerning the lenght of the recordings suitable for training. What I (think I) understood from some previous discussions:

  • The recordings shouldn’t be too long (e.g. an hour) because it may be too demanding in terms of processing, and also the learning algorithm is not designed for such cases.
  • However, whether the length is 5 seconds, 16 seconds or 30 seconds does not really matter.
  • It should be possible to train DeepSpeech even for one-word utterances, which means that it can be trained on recordings as short as e.g. 0,5 second.

To conclude, my impression is that there are no hard limits on either side - the upper boundary is given mostly by the fact that DeepSpeech works best for sentence-like recordings, and the lower limit just reflects the fact that the recordings should make some (linguistic) sense.
Am I right? Are there also other things I should consider when cutting my material?

Thank you very much.

Ideally you would use the same sort of material for training and for inference. As you don’t state what you are planning on doing it is hard to give an indication on what worked in the past.

Generally you are right, length is not really a problem for the algorithm, but for the GPU computing power is. Ideally your material is evenly distributed in a smaller time span. The recently released dataset by FB has 44k hours between 10 and 20 seconds with a mean of 15. I usually recommend sth between 5 and 10 seconds. I wouldn’t go under 1 second and not over 30 as there have been a lot of posts of people struggling with that, but maybe the successful ones didn’t post :slight_smile:

Thank you, Olaf, I understand. My inference data will be quite variable (various people are making announcements – some have just two seconds or so, some can even exceed 30 s). So I guess I should try to train it with similarly variable material, maybe just split the ones that go above 30 s (to be on the safe side).
I also wonder what hurts the algorithm more – whether to have such a variable length of recordings, or to have smaller time span, but linguistically less coherent data (i.e. more sentences in one recording to make it longer or incomplete sentences to make it shorter). But that is perhaps for another topic…
Thank you once again!

It is just more economical to train with same length data. So maybe merge smaller data or split longer sequences.

“Linguistically” is a matter for a custom language model. You can do with it whatever you want. So it would be no problem. The language model does not check grammar but words that occur often together.

Try some of your material with the current model and take it from there. Maybe fine tune or simply change the language model.

Understood. Thank you again for your help!