I am trying to use tensorflow_io (https://www.tensorflow.org/io/api_docs/python/tfio/audio) for reading other audio formats, like .mp3 and .ogg, directly using tensorflow ops.
I was able to successfully read such formats (and also extract mfccs). But my main concern is that the data and mfccs are not the same for all formats. Maybe this could be due to compression and so, but I would like to check what you guys think about it. Check the link below for an example.
If this difference is acceptable, then we can make Mozilla STT be able to train with a wide variety of audio formats. Of course, we would need to treat cases of different dtypes (float and int), but I see this as a promising feature for loading audio-data.
Some experiments comparing the loading of the same audio with different formats: experimentation
PS: the original audio was in .ogg, I converted to other formats using pydub’s AudioSegment.