@kathyreid, FYI:
I finished coding for audio analysis. I made it an optional part of my import process, where I also (optionally) transcode the audio into 16khz mono mp3 format. I use pytorch[audio] with GPU, pyarrow, dask distributed with futures, and parquet to process, and save only deltas between versions. I started with v3 now and will go up.
On my 6 core / 12 thread notebook (I’m in my summer house now) I can import (with transcode & analyze) 40-50 clips per second (depends on audio length). The whole process will take 140-160 hours runtime on this machine (CPU at 100%). SileroVAD is the most costly part as expected, about 60% of the process is spent for it, but at least I run 12 versions of it.
I calculate duration, VAD duration, speech power, silence power and an estimated SNR from log10 of speech power/silence power (not exact science but it is a relative value). I reverted to default settings in SileroVAD for the sake of SNR calculation.
When I checked the results, I found some 0 vad_duration recordings, which are either completely empty, or very low whispering, which are not suitable for model training, or negative SNR ones with too much background noise.
I also extract a full list of files which are corrupt in the datasets.
Thank you for mentioning SileroVAD.
PS: I’ll not release the clip-wise SNR values, as it might be used for TTS. I’ll release bad recordings and clip list, with some statistics & visualization. I hope I can (i.e. my machine can) finish it with v19.0 analysis…