I am new to the forum. I run an open source foundation in the medical space. We are currently focusing on training AI models for automated speech recognition. We noticed that the current speech recognition systems show biases, with less accuracy for women, non-native speakers and regional dialects. This is what we want to change, and we start in Germany. We want to create a omprehensive, open voice dataset with Mozilla’s Common Voice Platform that represents the diversity of German speakers.
I have following questions:
Does it make sense to host a separate instance of the Common Voice Platform? The reason for this is that we want to make it as easy as possible for users to donate their voice.
Are there people within the Mozilla Foundation who can provide initial guidance or support?
Are other people in different geo’s focusing on healthcare?
Are there known foundations that could help us financially to run campaigns?
I’m a volunteer here, mainly working on Turkish dataset, but also work on analysis of all CV datasets, and training models. What you are doing is great, I hope everything goes well.
Before the team members reply, let me try to give my views on your first two questions with some extra info:
One can use their own server to collect data, but that would be all against your ideals represented on your homepage. That dataset will be your own, creating another “asymmetry” and will not be part of CV. CV does not accept external datasets to be imported.
There can be some volunteers who might help you setup such a system, and in case of issues you can also get some help from from the matrix channel. But beware, the system is actively developed and if you keep it updated (also bugs fixed) you would need somewhat constant support as this would be a multi-year project.
More:
CV had/has a rather active leading team working on German dataset, there is also a sub-Discourse here.
CV recently introduced the Domain concept, healthcare is one of the options. So one can add domain specific sentences and people will also be able to select to record from their loved domains.
There is also the Accent and Variant concepts, and there are a lot of German accents already defined (no variant yet, but I think you can still suggest here). For them to be of use, one should create an account and fill that info in their profile. At the end of this post you can find current pre-defined German accents on file.
Except some closely monitored/curated ones, nearly every dataset, also datasets in CV have biases, and German is also biased in terms of gender. As this is a crowd-sourced project, without some directed campaigns, that bias cannot go away. This is something we all try to correct. But as German dataset is one of the largest ones, one could also use a curated/balanced subset of it to finetune an existing model for example. For detailed analytic info on German v18.0 you can check here.
One limitation of CV: The released datasets are CC-0, thus the text-corpus must also be CC-0. So, you cannot just copy text from books and paste here. You should either use public domain books (mostly old), or people should generate their own sentences and donate them. I can see you have a nice network of interested people, if each one writes 100 sentences, you will have 300.000 domain specific healthcare sentences for your text-corpus.
So:
I would suggest you to use the existing structure
Make contact with the German team here
Analyze the dataset to pinpoint the deficiencies and plan to correct them
Create a campaign
To collect domain specific sentences
To record them (you should access women and non-Germany locales or people with accents like me)
Ask here, on Matrix, or on DM anytime.
Viele Grüße
German Accents
268 de preset germany Deutschland Deutsch
269 de preset netherlands Niederländisch Deutsch
270 de preset austria Österreichisches Deutsch
271 de preset poland Polnisch Deutsch
272 de preset switzerland Schweizerdeutsch
273 de preset united_kingdom Britisches Deutsch
274 de preset france Französisch Deutsch
275 de preset denmark Dänisch Deutsch
276 de preset belgium Belgisches Deutsch
277 de preset hungary Ungarisch Deutsch
278 de preset brazil Brasilianisches Deutsch
279 de preset czechia Tschechisch Deutsch
280 de preset united_states Amerikanisches Deutsch
281 de preset slovakia Slowakisch Deutsch
282 de preset russia Russisch Deutsch
283 de preset kazakhstan Kasachisch Deutsch
284 de preset italy Italienisch Deutsch
285 de preset finland Finnisch Deutsch
286 de preset slovenia Slowenisch Deutsch
287 de preset canada Kanadisches Deutsch
288 de preset bulgaria Bulgarisch Deutsch
289 de preset greece Griechisch Deutsch
290 de preset lithuania Litauisch Deutsch
291 de preset luxembourg Luxemburgisches Deutsch
292 de preset paraguay Paraguayisch Deutsch
293 de preset romania Rumänisch Deutsch
294 de preset liechtenstein liechtensteinisches Deutscher
295 de preset namibia Namibisch Deutsch
296 de preset turkey Türkisch Deutsch
Thank you for your prompt answer. It’s wonderful to meet you!
Let me first correct your statement as it might lead to misunderstandings. We, as a non-profit, are fully focused on open data, meaning even if we would run the platform ourselves, we would license that database with open source licensing. The asymmetries we are fighting are related to the data as a common good versus data as a financial assets. We are 100% committed the the common good.
The reason why we are considering running our won instance is related to the focus on medical and the simplification for the donating users. I am in contact with large nursing organisations. Many of these people want to support but have low digital literacy skills. I the Mozilla Foundation and the team allows others to build a separate user interface that is branded with the specific campaign we are launching, we would not need to host our own instance.
I have no trouble either to create content that is published as CC-O. I do have some questions on the voice that being licensed as such. CC-O work may be used without them being able to influence how and for what purpose.
What if the voice data is being used for ethical questionable cases?
Did you ever consider to use the Responsible AI Licenses (RAIL-D)?
Next to foreigners speaking German, I miss all the German Dialects (Germany, Austria, Luxembourg, Belgium). I see that Swiss German is being classified as on single accent. This missing classification makes it difficult to create a well spread distributed dataset.
Most of the points you make can be answered by the project team (cc: @gina, @jesslynnrose). I was merely welcoming you with some technical data and directions.
I didn’t say otherwise, sorry if it is misunderstood. All I say is that you can accomplish your goals right here. Your incentive would also be very inspiring for other language communities.
Except re-branding of course, I have no idea on that. One idea that comes to mind is you create a campaign page and promote it, and a button on that page opens the real CV webapp. I personally find the UI/UX quite satisfactory. This is what we did in the past.
One major obstacle in such campaigns is use of social media/mobile devices, where webview style browsers are involved. You cannot click on a link from a Facebook link on Facebook app on mobile and record. You need to copy-paste the link on a regular browser and give permission for recording.
In case of elderly people, who can read and speak, a caretaker can prepare a laptop/mobile device specific for that person/session. “Specific” meaning: If you want to fight the gender/age/accent-dialect bias, you need to capture these classifiers, which can only be done through creating a profile. That would need a separate e-mail is needed for each volunteer, which can be problematic sometimes. In my own experience if not repeated regularly, elderly people get confused of the login procedure. Repurposing an old laptop for this use in each nursing home would help.
In any case, you might need to set some limits for person recording and make it supervised. Educate a caretaker for this purpose so that he/she can decide if the recording is satisfactory.
Indeed. As a voice donator, I’m also not happy with CC-0. Here is a related discussion from last year:
At least some of us, if not many are waiting some change in this area.
Those accents are defined at the start of the project, before the introduction of the variant concept. I also think they can be re-designed for further recordings, moving some to variants, and keeping some as accents in that variant. That would need working with Germen community here, with linguists and the project team, especially with @ftyers .
You can write to the sub-Discourse, Matrix or maybe @mkohler can help.
I unfortunately can’t. I am not active anymore (okay, considering myself active when it comes to the speaking/listening parts is a stretch… I only ever really did the sentence collector/extractor parts and not actual German related tasks). I have absolutely no idea who is currently contributing in German.
Well, regardless of dataset quality (like in bias issues), many people tend to move away whenever enough data is accumulated. Also, after release of multi-modal large language models from big boys and huge processing requirements of newer models played a role, I think…
@bozden thank you for your extensive response. @bozden has addressed most of the inquiries. To summarize and clarify:
Hosting an Instance of CV: CV already includes the functionalities you mentioned, such as accent, variants, and domain-specific data collection. Communities can contribute data for specific domains, accents, and variants. If the variant you want to contribute to isn’t currently listed, please let us know, and we’ll add it.
Campaigns: As mentioned, you can create a campaign page separately and link it to the CV page for contributions. We’re here to help with campaign designs, but unfortunately, we don’t have the budget to run campaigns.
CC0 Licensing: We do have community guidelines in place. While our control is limited due to the CC0 licensing, we are exploring alternative ways to promote responsible use practices.