Reading other audio formats (.mp3, .ogg, etc)

bernardohenz · August 20, 2020, 1:26am

I am trying to use tensorflow_io (https://www.tensorflow.org/io/api_docs/python/tfio/audio) for reading other audio formats, like .mp3 and .ogg, directly using tensorflow ops.

I was able to successfully read such formats (and also extract mfccs). But my main concern is that the data and mfccs are not the same for all formats. Maybe this could be due to compression and so, but I would like to check what you guys think about it. Check the link below for an example.

If this difference is acceptable, then we can make Mozilla STT be able to train with a wide variety of audio formats. Of course, we would need to treat cases of different dtypes (float and int), but I see this as a promising feature for loading audio-data.

Some experiments comparing the loading of the same audio with different formats: experimentation

PS: the original audio was in .ogg, I converted to other formats using pydub’s AudioSegment.

othiele · August 20, 2020, 7:22am

Transforming audio is a constant cause of problems with users. Typically, I am very cautious how and what to transform. Haven’t checked your colab in detail, but I would advise to use just a few well tested methods to transform your data into the typical PCM 16bit 16KHz before training, otherwise you could run into strange, hard to replicate runtime issues. You typically have the time in pre-processing or do you have a special use case?

bernardohenz · August 20, 2020, 12:22pm

We have a huge dataset which is stored in ogg format. ogg provides a good compression rate, which makes its smaller size preferred for saving space, faster transmission through internet, etc.

Whenever we want to train, we have to run a script for converting the audios to .wav (we convert to samplerate of 44100 due to some technical issues with streaming resampling, whose are not the scope of this topic). Well, besides the time for this conversion (which is taking more time than training a network itself), the converted 44k .wav files takes way more space than the ogg files.

Recently I’ve found the tensorflow_io (https://www.tensorflow.org/io/api_docs/python/tfio/audio), which allows reading other audio formats directly into tf graph. It even has a method for resampling (https://www.tensorflow.org/io/api_docs/python/tfio/audio/resample). I think this would be a nice feature to have, not only for me, but for all users.

othiele · August 20, 2020, 12:58pm

Fair point, given the current project situation I am not sure they put more stuff on the backlog, what do you think @lissyx?

lissyx · August 20, 2020, 3:19pm

You should check the SDB tooling we have. But i cant help you right now, im on holidays