Recommendations for recording at busy events

We seem to have had a large number of recordings in English from what sounds like a booth at some kind of trade show or event.

I am rejecting >99% of the clips because of background voices. I am letting a few clips through where there is a general hubbub of voices but where no individual background speech can be discerned. The problem is that in a lot of clips you can hear people explaining to other people what Common Voice is, so it seems like people are recording right next to where people are asking questions.

So if anyone is thinking of fielding contributions at a busy event, background noise needs to be considered if you want your contributions to make a difference. There should be some effort to physically separate the recording area from other areas of the booth if possible and also try to filter background noise as much as possible.

One way to improve this is to choose a microphone with a narrower recording field. An omnidirectional mic will record in 360 degrees so is more likely to pick up background noise, whereas a cardioid mic only picks up sounds at the front and tapers off at the sides/back (ideally choose a super/hypercardioid which has a tighter recording field).

Another thing to consider would be some kind of sound-proofing. The best solution would be an isolation booth, but this would be expensive (although they can be rented instead of purchased).

Example:

A more realistic solution would be to use shielding just around the microphone itself and such shields are pretty inexpensive.

Example:

@dabinat While validating some spanish clips, I could hear some ladies speaking English in the background, I vote yes for those clips, thanks for the clarification, if I encounter some of these I’ll have to think a little harder in validating.

If someone else is talking at the beginning or end of a clip (while the other person is not talking) it’s an instant rejection for me. There’s no way for DeepSpeech to know which voice it should transcribe and which it should ignore.

If they’re both talking at the same time it’s a much harder call to make. I tend to base it on how clearly the background voice can be heard and how much it conflicts with the main voice.

The speaker was reading spanish while other people were speaking english in the background, I get that it might be harder to tell if it’s valid or not in the same language, but what about this case? I think it’s better to reject right away, just to avoid confusion then. Thanks for the reply.

I have no experience outside of English, but DeepSpeech is character-based, not word-based, and both languages share a lot of characters. Plus, English has lots of words that are based on other languages, and I’m assuming Spanish speakers probably use some English words for things like technology terms.

I accidentally included a foreign language source in my English dataset and DeepSpeech did make a valiant attempt to transcribe it, even though it was the wrong language.

So it’s probably best to err on the side of rejecting these clips, unless the English dialogue cannot be clearly heard.

@dabinat The fact the DeepSpeech is char based is interesting, is there a plan or option to train a BPE based model? I haven’t look into the Deep Speech model in details.

Thanks for starting this. My understanding is that some background noise is ok and good for certain training as long as you are able to understand the person speaking, but I suspect that other people speaking at the same time is not good.

Maybe @kdavis can comment on which kind of noises are better or worse.

CHeers.

I often hear a lot of scraping or “knocking” when someone is recording, and I will immediately click “No” on that recording. The same thing goes for a generally low-quality recording where there is a lot of hissing sibilance or echo.

My understanding is that scraping, knocking, hissing, low quality are all useful for the dataset in order to have a diversity of recording environments, which mirrors the use cases. If these are the only problems, they should be voted “Yes”.

During the validation, the clips that should be “No” are those where what was said does not match what is written. That might be because of mistakes, incoherence, dropped audio, etc.

You’ve raised an interesting question above around background voices that would be good to hear from @kdavis on.

Background noise is generally ok, provided that it does not drown out the speaker’s voice. Echoes, low quality recordings, etc, should be fine too, provided that they are not too extreme.

Generally what you want to base decisions on is if you could understand what the person is saying just by listening, as if you didn’t have the transcript in front of you.

1 Like

I second what @dabinat said!