Audio Data Issue: Empty Audio File in Common Voice Dataset

Hello,

I would like to report an issue with one of the audio files in the English dataset of the Common Voice project. The file is completely silent and contains no audio.

client_id : b71e4c5ed06f1dcf24e26ad42cfb93e8c4907ce587651b6b4003b3242ca7c1227dddf27bbec00de87c006f44b802db93c5e687f2ee4f9d8d48caa94cc61b603a

  • path : common_voice_en_37663663.mp3
  • sentence_id : d0942841f92f613ad99a0a7511d3b8bf9e238f2867c1d496b579f10311876f81

Please let me know if you need any additional information or if there is anything else I can do to assist in resolving this issue.

Thank you for your attention to this matter.

Best regards,
Wangzhenghang

Further to this, I suspect there are many empty audio files in the CV English dataset - and indeed - other languages. I have a dataset which I’ll be releasing shortly which has SileroVAD voice activity detection timings (there are no timings if no speech is detected) and will be able to report them - but will need a few weeks.

1 Like

@kathyreid, I also started coding audio analysis last week (VAD based actual speech duration, SNR, energy/power etc) and actually researching the algorithms. I saw SileroVAD but CPU only & 1 core is quite a limitation.

Are there any other algorithms you considered, especially torchaudio & signal processing based?

Ah, no, I have access to a GPU cluster so that’s what I was running it on …

@kathyreid, I mean, I read that SileroVAD works only on 1 core on a CPU, no-GPU. Thus I eliminated it from my list - maybe a mistake. I saw a num-threads setting thou.

Does it work on a GPU or is there a version for it that I missed? If so that would be my best option.

They say it analyzes a 100 ms chunk 1 ms (provided that you can feed it - e.g. disk/ram/cpu-gpu bandwdith bottlenecks). To analyze 32k MCV recordings, that would mean 320 hours non-stop run, >13 days - for VAD only and excluding mp3 => wav 16kHz transcoding overhead.

Yes, it is multi-threaded

import torch
torch.set_num_threads(230)

That’s the setting I used - my understanding is that SileroVAD uses torchaudio under the hood.

Yeah, but multi-threading uses only one core on CPU. SileroVAD seems to be incompatible with GPU version or torchaudio. Perfect for some applications, interactive ones working on edge devices (e.g. mobile phones do not have CUDA), but IMO not ideal for bulk processing…

I think I can make it multi-processing + multi-threaded to boost performance - though I use GPU version of torch, I need a remedy for it.

Thank you @kathyreid @bozden. @84305424 thank you bringing this issue up, I will also bring this issue up with the team.

1 Like

@kathyreid, I implemented it by tweaking the threshold (due to low energy/whispering voices) and some duration parameters. I tested Vad in torchaudio, but it was not as good and needed two passes to find the silence at the end by flipping the Tensor.

I have two questions you might answer:

I’m testing/tuning it on Fleurs dataset as it is more controlled/predictable, and I tuned down the threshold a lot, but even then there are some non-recognitions due to low amplitude voices.

  • Do you normalize the audio amplitude-wise before detections?
  • How did you tune/test it? By analyzing the waveforms?

Second question: As you know, SileroVAD returns a list of detections, so some silence during speech (e.g. breathing) creates a list, even I increased the silence duration from 250ms to 1000ms. I have now two options:

  1. Only take the first one’s starting and last one’s ending point to calculate the duration (which would only remove the silences at the start and at the end)
  2. Sum section durations (which gives only the actual speech)

I implemented the second option, but maybe the first option is more logical. What did you use? What do you propose?

@kathyreid, FYI:

I finished coding for audio analysis. I made it an optional part of my import process, where I also (optionally) transcode the audio into 16khz mono mp3 format. I use pytorch[audio] with GPU, pyarrow, dask distributed with futures, and parquet to process, and save only deltas between versions. I started with v3 now and will go up.

On my 6 core / 12 thread notebook (I’m in my summer house now) I can import (with transcode & analyze) 40-50 clips per second (depends on audio length). The whole process will take 140-160 hours runtime on this machine (CPU at 100%). SileroVAD is the most costly part as expected, about 60% of the process is spent for it, but at least I run 12 versions of it.

I calculate duration, VAD duration, speech power, silence power and an estimated SNR from log10 of speech power/silence power (not exact science but it is a relative value). I reverted to default settings in SileroVAD for the sake of SNR calculation.

When I checked the results, I found some 0 vad_duration recordings, which are either completely empty, or very low whispering, which are not suitable for model training, or negative SNR ones with too much background noise.

I also extract a full list of files which are corrupt in the datasets.

Thank you for mentioning SileroVAD.

PS: I’ll not release the clip-wise SNR values, as it might be used for TTS. I’ll release bad recordings and clip list, with some statistics & visualization. I hope I can (i.e. my machine can) finish it with v19.0 analysis…

1 Like

Thank you so much, @bozden for all your hard work on this.

I have a dataset with just the VAD durations in, it’s not publicly released yet, but I should be able to confirm / validate your findings at least on the empty / high SnR files.

1 Like

That would be great, thank you!

I hope your SileroVAD settings are not very different than the defaults - to be able to compare.

1 Like

Thank you so much for this, I’ll be so excited to review this once it’s complete and we massively appreciate you taking the time to run this.

2 Likes

Hey @kathyreid, @jesslynnrose, I released the statistics and “bad-audio” (subjective) listings. You can read more on it here.

@kathyreid, I can send you detailed records if you PM me on your dataset (language/version) to compare with your results. I need to extract them, the .tsv file is 4 GB in size for all languages.