2 Channel wav files in CV data

  • Training or just running inference - Training
  • Moziall STT branch/version - 0.8.2
  • OS Platform and Distribution - Red Hat Enterprise Linux Server 7.8 (Maipo)
  • Python version - 3.6.8
  • TensorFlow version - 1.15.2

I’ve been working with the Russian Common Voice data, imported with import_cv2.py and during validation DeepSpeech just stopped. No errors, new checkpoints, it just hung. Through a lot of digging I discovered it was because some of the wav files had 2 channels, which failed the assert check when converting pcm to np in audio.py. The really weird part is that all of them were in the dev set, none in train or test. I wrote a bash script to identify the wav files with multiple channels, happy to provide it if you’d like.

Looking through import_cv2 I didn’t see any channel conversion, unless I missed it. So if that’s supposed to be converting 2 channel audio files into 1, something’s going wrong in there. The biggest issue though, imo, is that it fails silently. While I’ve figured it out for my own use, I can see others running into this, especially since I’m using Common Voice data, so I thought you’d want to know.

Let me know if you need any more info or if I can help in any other way.

Thanks for pointing that out and yes, there are things to improve.

  1. If it is not long, simply post the list as preformatted here, so others may use it in the future. Due to the changes at Mozilla I am unsure whether the Common Voice team will improve their systems. Audiomate has a list of invalid German CV files, maybe do a PR there if you use several data sources?

  2. Do you have an idea how to fix the import script to avoid that in the future?

This should not happen, Common Voice is not supposed to release stereo data. Ith appened on some datasets they shipped on the latest release, please file a bug with them.

On their git? Can do.

1 Like

Yes, please. With as much as details if you can, like the filenames involved.

Done. Besides the bad data getting through CV, the other issue I was having was the error not being raised and hanging, but maybe that’s just the server I was on? Hope so. As a potential (non-priority) feature it would be nice to have error logging report which audio file causes issues when they happen, right now the best you get is what step you were on.

this is weird and does not concurr to our experience when bad audio actually breaks the import

That should already be the case, but patches to improve are always welcome.

this is weird and does not concurr to our experience when bad audio actually breaks the import

Sorry I wasn’t clear, this was during validation (training when I was using the bad data to reach the problem faster), not during import. The import went through just fine with the bad data.

Then the importer should be fixed and detect those buggy cases. Making the training and validation code handling those will produce messy unmaintainable code. Dataset regulation should happen at the import level.