Ideal length of training recordings

Zuz_Oc · December 29, 2020, 4:51pm

I have a question concerning the lenght of the recordings suitable for training. What I (think I) understood from some previous discussions:

The recordings shouldn’t be too long (e.g. an hour) because it may be too demanding in terms of processing, and also the learning algorithm is not designed for such cases.
However, whether the length is 5 seconds, 16 seconds or 30 seconds does not really matter.
It should be possible to train DeepSpeech even for one-word utterances, which means that it can be trained on recordings as short as e.g. 0,5 second.

To conclude, my impression is that there are no hard limits on either side - the upper boundary is given mostly by the fact that DeepSpeech works best for sentence-like recordings, and the lower limit just reflects the fact that the recordings should make some (linguistic) sense.
Am I right? Are there also other things I should consider when cutting my material?

Thank you very much.

othiele · December 30, 2020, 9:16am

Ideally you would use the same sort of material for training and for inference. As you don’t state what you are planning on doing it is hard to give an indication on what worked in the past.

Generally you are right, length is not really a problem for the algorithm, but for the GPU computing power is. Ideally your material is evenly distributed in a smaller time span. The recently released dataset by FB has 44k hours between 10 and 20 seconds with a mean of 15. I usually recommend sth between 5 and 10 seconds. I wouldn’t go under 1 second and not over 30 as there have been a lot of posts of people struggling with that, but maybe the successful ones didn’t post

Zuz_Oc · December 30, 2020, 9:34am

Thank you, Olaf, I understand. My inference data will be quite variable (various people are making announcements – some have just two seconds or so, some can even exceed 30 s). So I guess I should try to train it with similarly variable material, maybe just split the ones that go above 30 s (to be on the safe side).
I also wonder what hurts the algorithm more – whether to have such a variable length of recordings, or to have smaller time span, but linguistically less coherent data (i.e. more sentences in one recording to make it longer or incomplete sentences to make it shorter). But that is perhaps for another topic…
Thank you once again!

othiele · December 30, 2020, 9:47am

It is just more economical to train with same length data. So maybe merge smaller data or split longer sequences.

“Linguistically” is a matter for a custom language model. You can do with it whatever you want. So it would be no problem. The language model does not check grammar but words that occur often together.

Try some of your material with the current model and take it from there. Maybe fine tune or simply change the language model.

Zuz_Oc · December 30, 2020, 9:54am

Understood. Thank you again for your help!

Topic		Replies	Views
Information on training and inferring audio file length DeepSpeech	5	1115	August 15, 2018
DeepSpeech training voice sample duration DeepSpeech	6	763	January 13, 2020
Longer audio files with Deep Speech DeepSpeech	12	12034	November 21, 2019
Can i train the model with longer audio files? DeepSpeech	1	1139	February 9, 2018
Can DeepSpeech process longer audio files? DeepSpeech	5	6413	December 18, 2019

Ideal length of training recordings

Related topics