Accent doesn’t just apply regionally, it also depends on sex, age, social class, ethnicity, native tongue and even sexuality; things that are quite personal to the speaker.
Rather than cataloguing every accent and either putting people into buckets or forcing them to self-identity, I think it’d make more sense to detect them automatically and not care about human classification at all.
Maybe use a single sentence in each language that every contributor reads from time to time, and compute accent markers from that data. The discovered clusters and distance from them become accent detection / calibration data, and end-users read that same phrase to get their own accent calibration.
Is this how it actually works, and labels will just be used just to detect skew in the data sets?
It’s an interesting idea, but it’s very hard to develop something which automatically detects accents without any labeled data. You talk about clustering by distance, but I have never seen such a distance metric which actually produces natural clustering by accent. There is just too much other noise: vocal tract length, microphone, speaking rate, background sounds… Even if you could get such natural clusters, you wouldn’t be able to put a name to them or compare them with any ground truth without some human labeling or another outside source.
As you say, there are some difficulties with putting people into buckets or forcing them to self-identify, but it still provides a place to start from even if it isn’t perfect. Can I ask why you think it would make more sense not to care about human classification? I can certainly see why people would be concerned about privacy, for example.
When you ask “is this how it actually works”, realize that this dataset can be used for many different purposes by all different developers and groups. One is training models for speech recognition, in which case you might for example train a model for just one accent, or one for each accent, or provide the accent label as input to the model. Another could be training accent identification models. Other projects present the CommonVoice data to human learners of a foreign language, as a model. Within the CommonVoice project itself, I’ve proposed using the accent labels to present reviewers with only accents they’re familiar with.
Yeah my point was that they wouldn’t matter from the point of view of recognition, only human categorisation. “Speaker profile” could be completely independent of that.
Yes mostly because it’s dependent on knowing a hell of a lot about each speaker and their background, which violates the privacy requirements of the project. That and after thinking about it for a bit I realised that accent is a broad and deep problem that is very difficult to apply categories to, filled with social, cultural and cognitive biases that invites decades of bikeshedding. Solving it using machine learning seems like an obvious solution to that.
I meant in the context of recognition. But the other points you raise are good ones.
Maybe there’s some value in finding a few sentences that work for everyone and offer the most amount of variation, and inviting university researchers to build a database of tagged recordings that could be used to detect accent?
And what is with copying a fully 10000h trained and working standard english language model (just as example) with a 10000h fully trained britsh isles dialect (scottish) and train both together? If this works you add more (irish) and so on.
But the more you add you are leaving Terabyte volumes and end in PetaByte or Storage Farms just for trying this. Processing and storage of such huge amounts of data would be the challenge.