I think itâs generally a good idea to collect as much data as possible (and as much as people agree to), because based on that raw data it might be possible to optimize the model in the future and raw and accurate date is always a good base for any work of this kind.
I pretty agree to use birth places data for accent data, which we currently used in accent menu zh-cn and zh-tw locale. Many language researchers told me that itâs more easy and simple for people to choose and for them to process the data.
The first problem with the dataset is that itâs in English, people donât used to select the name of the city with English, and the auto-complete system would be a problem to have no suggestion with the local name, especially when we encourage people not used to technologies to contribute their voice.
The second problem is that I had checked the list. itâs only about 20% accuracy in Taiwan compared to the current city list*1 we use, and it only has 800 cities for China which should be more than 2.5K *2. We would need to fork it and maintain it.
*1 https://github.com/mozilla/voice-web/pull/1876/files#diff-a5837602d8f8b2f77869552f3896bbd2
*2 http://big5.www.gov.cn/gate/big5/www.gov.cn/test/2011-08/22/content_1930111.htm
Third, when we discuss which accent list is better for zh-cn, the local community had a long discussion and decided to use the province as birthplace but not city level. There are too many cities and the accent difference across cities are not obvious. Of course the more detail the better but we still should manage a reasonable number for easier maintenance and prevent frequent change necessary (on City level). For China I would think of the second level is more suitable than cities, which we would have about 350 prefectures.
We also want to use an official list with numbering ID from China PID system*3 (which the first 2 digits representing the birthplace) for people to choose easier. Everyone can easily find the options they should select because of the shared system even if they are older people or with little education, they will know how to identify themself. And the voice data is easier to process with the official id.
*3 https://github.com/mozilla/voice-web/pull/1929/files#diff-a5837602d8f8b2f77869552f3896bbd2
Considering the problems, I think itâs better to ask the contributor to raise the best places list for their locale? We only have 28 languages and I donât think it difficult for each locale to come up with a most suitable list.
Please note that the proposal is not about âbirth placesâ but âwhere is your accent coming fromâ and allow people to self-evaluate this.
You can have born in a specific place, but now you are living in another and you recognize your accent is now coming from the new place, or not, thatâs why we should allow people to decide for themselves.
Sure, Iâm giving the feedback on how to come up with places list, whether itâs self-identified or not works.
Perhaps we should say âselect the most appropriate place, if you donât know how to than select your birthplaceâ to make it easier for some people.
Please consider also accents / a different sound by a cultural background but not on a different location. For example immigrants, that use a different toung e.g. turkish-german, russian-german, âŠ
Same for non-native speakers e.g. german-english, french-english, ⊠which might result in a very different pronounciation ⊠(I already pointed it out at Region or dialect)
Please also ask local translators to find the right / common accents for a language, otherwise users might not be willing to select the right accent if they are unfamilar with the (too generic) labels or canât find the right entry.
If I understood your request correctly, the current proposal also captures this, since accent for non-native languages can be âMy native languageâ, so for example if you speak French as native and you also have English as second language, you can select your English accent as âcoming from my native languageâ if you consider you have a French accent when speaking in English.
Hi Rubén, thanks for soliciting this feedback! Would you be open to including an accent category (or categories) for people with impaired enunciation from Dysarthria (see e.g. https://www.asha.org/Practice-Portal/Clinical-Topics/Dysarthria-in-Adults/) ?
Iâm developing a speech-to-text model for someone with spastic dysarthria, and weâre planning to retrain a DeepSpeech model on our audio/transcription data. So we need to record and transcribe a bunch of audio clips anyway for our training data. When I came across the common voice project (via DeepSpeech) it occurred to me that if we could collect/store our transcription on common voice, then this voice data would be available for everyone, and maybe someone else can come up with an even better model than we do.
I think this population would be a great match for common voice - people who can be understood by other people who know them, but who are not understood by Google/Amazon voice APIs or by humans who are not familiar with their voice.
Hi, I donât really know how this would work. My understanding is that in order to train a model we need at least 1000 different voices and we should figure out with a linguistic if this can be considered an accent or each person with impaired enunciation would have a different unique one, which would limit us from creating a general model for it.
Hi RubĂ©n, my idea was just to create an accent category or categories, so that the audio/transcription data would be publically available through the common voice dataset. I wouldnât rely on Mozilla to make a model, Iâm training our own voice model anyway (which I would also be happy to share if desired).
On your 1000 different voices concern - I have two responses. First, I think we could definitely find this many people or more with these voice disorders who are interested in contributing their voice to common voice because this is a population for whom existing voice transcription models do not work adequately. Second, suppose even a general âspastic dysarthriaâ model does not work well for most specific individuals, and so a useful model would generally need additional training data from each individual user (I donât know yet whether or not this is the case). Even in this case though, the person-specific model would still clearly be useful far sooner, and need far less person-specific training data, if retrained from a pretrained âspastic dysarthriaâ model, rather than if retraining started from a model previously trained on those with typical enunciation. So any application built on-top of a voice-transcription model aiming to be useful to individuals with voice disorders would be useful far more quickly to each user, with less training time required.
I appreciate your consideration and if you still feel like this is just outside the scope of common voice, I understand. But I do think including a dysarthria accent category (ideally with subcategories by type listed in my previous link) would help facilitate the creation of really useful voice transcription models by the community.
There is also a correlation between accent and the level of studies and bilingualism (influence).
We can also ask people to give their level of study and other spoken languages for better segmentation.
Hi, do you have a link to read more about this? It would be interesting to consider if itâs backed by linguistic experts.
Hi, nice topic.
I agree for the first question about âNative or notâ, but I prefer it on the second position, we can begin with something like " Do you think that you have a particular regional or foreign accent ? ". If the response is no, no need more, otherwise we demande if the language is Native or not.
Next I think that there is absolutely no need to have tree levels of details, we will complicate the process for insignificant differences ( if any between cities ).
I propose for non Native speakers contributors to declare their native language and country and for native speakers contributors their country and region. The region must be as large as possible, for example, in France we can propose one of the 13 metropolitan regions instead of departments ( there is 97 ) .
@achraf.khelil thanks for your feedback and welcome to the Common Voice community!
It seems to me that your message is centered in implementation and user experience. I would said thatâs something our UX experts will figure out how to make it easy when we get there. In my opinion we shouldnât be worrying much now, and focus on agreement on the basic needs and strategy.
Cheers.
Thank you for your response, but Iâm not focused on UX. I suggested these changes to :
- Respond on the confusing situations announced by @ dschridde , non native speakers without accent and native speakers with accent.
- Reduce the data set and improve segmentation performance by bringing it closer to reality. The reality is that there are no different accents between cities, but between regions. And for non native speakers, theirs regions and cities are not important, itâs theirs native language and country which can influence their accent.
I would say thatâs a very broad statement. I have some examples in my country where accents change between cities within the same region, so this can be different from one country to another. Also, this is what we heard from our linguistic experts.
The good thing about this proposal is that people can decide if self-identify themselves with country, region, city or none. This should not influence because we have data about where each city is located (region and country) so wonât affect people just selecting regions in countries where there is no difference between cities.
In my humble opinion and experience, there are lots of nuances to âaccentâ in a voice and the words used in communication. It will be difficult to capture all of these nuances with just one location.
However, given that we are dealing with this data primarily for research and development â including machine learning â I think there is value in categorizing contributions by the apparent characteristics of the contribution.
I believe this is best supported by âtagsâ which augment the primary category. For example, a speaker of English could have their category (âEnglishâ) described by multiple tags. #us_southern might appropriately apply to a large group of speakers in the United States, but even people in the same city might sound dramatically different. In Kansas City I think I would apply #us_midwestern to most speakers, but additionally some might be described with #urban and/or #latino even though they grew up within a few miles of each other. There are likely subtleties that would apply to someone #latino from #newyork_bronx, or someone #jewish from #newyork_bronx.
This sort of tagging is a little more complex, but I think the accent difference between geographically near locations in New York are much more pronounced than between Indiana and southern Kansas which are hundreds of miles apart and would both fall under a general #us_midwestern tag.
In less widely spoken languages I can imagine location defines much of this. But I also think a tagging system would serve just as well.
The challenge with this system is keeping the tags under control (avoiding both #southern and #us_southern, for example). But if you allow a dynamic and easily searchable list that could be curated by admins to merge similar tags â including automatic updates to users who had the curated tags â plus if you can keep the data associated with the individual so later refinements of their tags would retroactively be applied to earlier contributions then I think you would have a usable mechanism.
Iâm also worried about the complexity of this proposal, we donât have bandwidth to maintain a curated list of tags or time to have admins patroling duplicates in all languages.
Being more granular than city is probably better, yes, but at some point we need to make a compromise between utility, complexity and resources.
Hi! Iâm working on a fully accent-independent way to help everyone with pronunciation remediation for free. Instead of examining whether an accented utterance is or is not âcorrectâ according to a pronunciation expert or panel of judges, we use Nakagawa (2011) and his grad studentsâ method of trying to predict whether a listener, whether they be a native English speaker or not, would transcribe the utterance as the speech which was supposed to have been said. Please join the Spoken Language Interest Group of the IEEE Learning Technologies Standards Committee at http://bit.ly/slig and follow the main Discourse topic at: Intelligibility remediation
Thank you!
One small point : in your definition of a language in order to avoid many variants of the same language, you make reference to a common writing system. It means that for a language without a stable writing system (and they are numerous) your definition will not be operative; you will not be able to say if itâs a variant of the language with a variant of the writing system, or a new language. If you want to be normative, you may add âif a writing system exists which allows you to express the same words with the same grammarâŠâ