I was able to successfully read such formats (and also extract mfccs). But my main concern is that the data and mfccs are not the same for all formats. Maybe this could be due to compression and so, but I would like to check what you guys think about it. Check the link below for an example.
If this difference is acceptable, then we can make Mozilla STT be able to train with a wide variety of audio formats. Of course, we would need to treat cases of different dtypes (float and int), but I see this as a promising feature for loading audio-data.
Some experiments comparing the loading of the same audio with different formats: experimentation
PS: the original audio was in .ogg, I converted to other formats using pydub’s AudioSegment.
Transforming audio is a constant cause of problems with users. Typically, I am very cautious how and what to transform. Haven’t checked your colab in detail, but I would advise to use just a few well tested methods to transform your data into the typical PCM 16bit 16KHz before training, otherwise you could run into strange, hard to replicate runtime issues. You typically have the time in pre-processing or do you have a special use case?
We have a huge dataset which is stored in ogg format. ogg provides a good compression rate, which makes its smaller size preferred for saving space, faster transmission through internet, etc.
Whenever we want to train, we have to run a script for converting the audios to .wav (we convert to samplerate of 44100 due to some technical issues with streaming resampling, whose are not the scope of this topic). Well, besides the time for this conversion (which is taking more time than training a network itself), the converted 44k .wav files takes way more space than the ogg files.