Privacy concerns about dataset metadata

I have a question about the intent here.

Say we collect cities with each donated voice, and someone uses that data to build this:

Take a piece of audio data, and geo-tag each participant in that conversation.

Is that a feature we want for the common-voice dataset?

Participants are anonymized in the dataset so nobody can link a voice clip to a specific person/user.

Similar question here. The proposal states that:

Which broadly makes sense to me (not a speech recognition expert): I understand that in general, the more data, the better. But (again, as a non-expert) it’s difficult to imagine how location will be used when people build tools using this data. What are some examples of applications in which this information is useful? I think having this kind of context in the proposal could better inform the type of feedback someone gives.

Let’s say you want to do speech recognition in English and you use this with users in the US. If you model is good enough you will be able to recognize some voices but have big problems recognizing strong accents.

We got a few emails form people living in the South of the US complaining about how current commercial applications are not able to understand their accent really well or at all.

Capturing the accent and encouraging a diverse set of them we can have a strong trained model who is able to recognize more accents than commercial applications.

People can also create models for very specific cases, let’s say your target audience is a region in a specific country and you want you model to be specifically optimized for accents in this region.

You’re missing the point.

There’s audio data matched to locations in our data set. You can train an AI system to predict where people are coming from based on their pronunciation.

While you can probably detect if someone’s been participating in common voice in any other recording quite easily (and thus point to all the meta data associated with the training data), you can also just try to find out where people are that are not in the training data.

Can you elaborate a bit more on how you can detect if someone has been participating easily? (since we are currently already asking for language and accent, we might want to split the conversation just to talk about this concern)

With some google search, there seems to be quite successful research on speaker recognition.

The location data isn’t where the user is, it’s where they’re from. So I don’t know if it’s all that useful in trying to track down a person. In my case you’d end up thousands of miles away from where I live now.

1 Like