Is it compulsory to have training and inferring audio file length equal to 5 seconds?
I have this questions because I have a large amount of training data with audio(every audio more than 30 seconds) and respective transcripts. If I can’t use this data as it is for training, then I need to chunk the audio files( which I can do easily with some python script) but I am finding it difficult to chunk the transcript for the respective chunked audio files. I am doing it manually for now, but is there any way to automate it?
@meghagowda5193 Having training and inferencing audio file length equal to 5 seconds is not compulsory. With the current architecture it just lessens the memory pressure on the GPU, if a GPU is being used to train on.
However, 30sec may be too much for you GPU’s memory. You simply have to try. To give you a feel for scale, we had batches of size 12 or so using audio with lengths around 5 sec with a 11GB GPU. So, for a 11GB GPU we can fit about 60 sec of audio on the GPU with the current model when training.
However, there are other issues to consider, for example the finite horizon a RNN operates under, i.e. the forward or backward RNN of the BRNN may not be able to transfer information across the entire 30 second of audio and thus performance may decrease as a result. But it’s worth a try.
As before there is no hard limit, but increasing the length of the training data snippets will increase memory pressure on your GPU and lead to problems in which the RNN may not be able to transfer information across the entire N seconds of audio