Audio Data Issue: Empty Audio File in Common Voice Dataset

Hello,

I would like to report an issue with one of the audio files in the English dataset of the Common Voice project. The file is completely silent and contains no audio.

client_id : b71e4c5ed06f1dcf24e26ad42cfb93e8c4907ce587651b6b4003b3242ca7c1227dddf27bbec00de87c006f44b802db93c5e687f2ee4f9d8d48caa94cc61b603a

  • path : common_voice_en_37663663.mp3
  • sentence_id : d0942841f92f613ad99a0a7511d3b8bf9e238f2867c1d496b579f10311876f81

Please let me know if you need any additional information or if there is anything else I can do to assist in resolving this issue.

Thank you for your attention to this matter.

Best regards,
Wangzhenghang

Further to this, I suspect there are many empty audio files in the CV English dataset - and indeed - other languages. I have a dataset which I’ll be releasing shortly which has SileroVAD voice activity detection timings (there are no timings if no speech is detected) and will be able to report them - but will need a few weeks.

1 Like

@kathyreid, I also started coding audio analysis last week (VAD based actual speed, SNR, energy/power etc) and actually researching the algorithms. I saw SileroVAD but CPU only & 1 core is quite a limitation.

Are there any other algorithms you considered, especially torchaudio & signal processing based?

Ah, no, I have access to a GPU cluster so that’s what I was running it on …

@kathyreid, I mean, I read that SileroVAD works only on 1 core on a CPU, no-GPU. Thus I eliminated it from my list - maybe a mistake. I saw a num-threads setting thou.

Does it work on a GPU or is there a version for it that I missed? If so that would be my best option.

They say it analyzes a 100 ms chunk 1 ms (provided that you can feed it - e.g. disk/ram/cpu-gpu bandwdith bottlenecks). To analyze 32k MCV recordings, that would mean 320 hours non-stop run, >13 days - for VAD only and excluding mp3 => wav 16kHz transcoding overhead.

Yes, it is multi-threaded

import torch
torch.set_num_threads(230)

That’s the setting I used - my understanding is that SileroVAD uses torchaudio under the hood.

Yeah, but multi-threading uses only one core on CPU. SileroVAD seems to be incompatible with GPU version or torchaudio. Perfect for some applications, interactive ones working on edge devices (e.g. mobile phones do not have CUDA), but IMO not ideal for bulk processing…

I think I can make it multi-processing + multi-threaded to boost performance - though I use GPU version of torch, I need a remedy for it.

Thank you @kathyreid @bozden. @84305424 thank you bringing this issue up, I will also bring this issue up with the team.

1 Like

@kathyreid, I implemented it by tweaking the threshold (due to low energy/whispering voices) and some duration parameters. I tested Vad in torchaudio, but it was not as good and needed two passes to find the silence at the end by flipping the Tensor.

I have two questions you might answer:

I’m testing/tuning it on Fleurs dataset as it is more controlled/predictable, and I tuned down the threshold a lot, but even then there are some non-recognitions due to low amplitude voices.

  • Do you normalize the audio amplitude-wise before detections?
  • How did you tune/test it? By analyzing the waveforms?

Second question: As you know, SileroVAD returns a list of detections, so some silence during speech (e.g. breathing) creates a list, even I increased the silence duration from 250ms to 1000ms. I have now two options:

  1. Only take the first one’s starting and last one’s ending point to calculate the duration (which would only remove the silences at the start and at the end)
  2. Sum section durations (which gives only the actual speech)

I implemented the second option, but maybe the first option is more logical. What did you use? What do you propose?