Common Voice for Healthcare (Edge Cases)

Hi,

I am new to the forum. I run an open source foundation in the medical space. We are currently focusing on training AI models for automated speech recognition. We noticed that the current speech recognition systems show biases, with less accuracy for women, non-native speakers and regional dialects. This is what we want to change, and we start in Germany. We want to create a omprehensive, open voice dataset with Mozilla’s Common Voice Platform that represents the diversity of German speakers.

I have following questions:

  • Does it make sense to host a separate instance of the Common Voice Platform? The reason for this is that we want to make it as easy as possible for users to donate their voice.
  • Are there people within the Mozilla Foundation who can provide initial guidance or support?
  • Are other people in different geo’s focusing on healthcare?
  • Are there known foundations that could help us financially to run campaigns?

Best Regards,
Bart, Hippo AI Foundation, Berlin

Hallo Bart, welcome !

I’m a volunteer here, mainly working on Turkish dataset, but also work on analysis of all CV datasets, and training models. What you are doing is great, I hope everything goes well.

Before the team members reply, let me try to give my views on your first two questions with some extra info:

  • One can use their own server to collect data, but that would be all against your ideals represented on your homepage. That dataset will be your own, creating another “asymmetry” and will not be part of CV. CV does not accept external datasets to be imported.
  • There can be some volunteers who might help you setup such a system, and in case of issues you can also get some help from from the matrix channel. But beware, the system is actively developed and if you keep it updated (also bugs fixed) you would need somewhat constant support as this would be a multi-year project.

More:

  • CV had/has a rather active leading team working on German dataset, there is also a sub-Discourse here.
  • CV recently introduced the Domain concept, healthcare is one of the options. So one can add domain specific sentences and people will also be able to select to record from their loved domains.
  • There is also the Accent and Variant concepts, and there are a lot of German accents already defined (no variant yet, but I think you can still suggest here). For them to be of use, one should create an account and fill that info in their profile. At the end of this post you can find current pre-defined German accents on file.
  • Except some closely monitored/curated ones, nearly every dataset, also datasets in CV have biases, and German is also biased in terms of gender. As this is a crowd-sourced project, without some directed campaigns, that bias cannot go away. This is something we all try to correct. But as German dataset is one of the largest ones, one could also use a curated/balanced subset of it to finetune an existing model for example. For detailed analytic info on German v18.0 you can check here.
  • One limitation of CV: The released datasets are CC-0, thus the text-corpus must also be CC-0. So, you cannot just copy text from books and paste here. You should either use public domain books (mostly old), or people should generate their own sentences and donate them. I can see you have a nice network of interested people, if each one writes 100 sentences, you will have 300.000 domain specific healthcare sentences for your text-corpus.

So:

  • I would suggest you to use the existing structure
  • Make contact with the German team here
  • Analyze the dataset to pinpoint the deficiencies and plan to correct them
  • Create a campaign
    • To collect domain specific sentences
    • To record them (you should access women and non-Germany locales or people with accents like me)

Ask here, on Matrix, or on DM anytime.

Viele Grüße

German Accents

268	de	preset	germany	Deutschland Deutsch
269	de	preset	netherlands	Niederländisch Deutsch
270	de	preset	austria	Österreichisches Deutsch
271	de	preset	poland	Polnisch Deutsch
272	de	preset	switzerland	Schweizerdeutsch
273	de	preset	united_kingdom	Britisches Deutsch
274	de	preset	france	Französisch Deutsch
275	de	preset	denmark	Dänisch Deutsch
276	de	preset	belgium	Belgisches Deutsch
277	de	preset	hungary	Ungarisch Deutsch
278	de	preset	brazil	Brasilianisches Deutsch
279	de	preset	czechia	Tschechisch Deutsch
280	de	preset	united_states	Amerikanisches Deutsch
281	de	preset	slovakia	Slowakisch Deutsch
282	de	preset	russia	Russisch Deutsch
283	de	preset	kazakhstan	Kasachisch Deutsch
284	de	preset	italy	Italienisch Deutsch
285	de	preset	finland	Finnisch Deutsch
286	de	preset	slovenia	Slowenisch Deutsch
287	de	preset	canada	Kanadisches Deutsch
288	de	preset	bulgaria	Bulgarisch Deutsch
289	de	preset	greece	Griechisch Deutsch
290	de	preset	lithuania	Litauisch Deutsch
291	de	preset	luxembourg	Luxemburgisches Deutsch
292	de	preset	paraguay	Paraguayisch Deutsch
293	de	preset	romania	Rumänisch Deutsch
294	de	preset	liechtenstein	liechtensteinisches Deutscher
295	de	preset	namibia	Namibisch Deutsch
296	de	preset	turkey	Türkisch Deutsch

Merhaba Bülent,

Thank you for your prompt answer. It’s wonderful to meet you!

Let me first correct your statement as it might lead to misunderstandings. We, as a non-profit, are fully focused on open data, meaning even if we would run the platform ourselves, we would license that database with open source licensing. The asymmetries we are fighting are related to the data as a common good versus data as a financial assets. We are 100% committed the the common good.

The reason why we are considering running our won instance is related to the focus on medical and the simplification for the donating users. I am in contact with large nursing organisations. Many of these people want to support but have low digital literacy skills. I the Mozilla Foundation and the team allows others to build a separate user interface that is branded with the specific campaign we are launching, we would not need to host our own instance.

I have no trouble either to create content that is published as CC-O. I do have some questions on the voice that being licensed as such. CC-O work may be used without them being able to influence how and for what purpose.

  • What if the voice data is being used for ethical questionable cases?
  • Did you ever consider to use the Responsible AI Licenses (RAIL-D)?

Next to foreigners speaking German, I miss all the German Dialects (Germany, Austria, Luxembourg, Belgium). I see that Swiss German is being classified as on single accent. This missing classification makes it difficult to create a well spread distributed dataset.

How do I contact the local team?

Most of the points you make can be answered by the project team (cc: @gina, @jesslynnrose). I was merely welcoming you with some technical data and directions.

I didn’t say otherwise, sorry if it is misunderstood. All I say is that you can accomplish your goals right here. Your incentive would also be very inspiring for other language communities.

Except re-branding of course, I have no idea on that. One idea that comes to mind is you create a campaign page and promote it, and a button on that page opens the real CV webapp. I personally find the UI/UX quite satisfactory. This is what we did in the past.

One major obstacle in such campaigns is use of social media/mobile devices, where webview style browsers are involved. You cannot click on a link from a Facebook link on Facebook app on mobile and record. You need to copy-paste the link on a regular browser and give permission for recording.

In case of elderly people, who can read and speak, a caretaker can prepare a laptop/mobile device specific for that person/session. “Specific” meaning: If you want to fight the gender/age/accent-dialect bias, you need to capture these classifiers, which can only be done through creating a profile. That would need a separate e-mail is needed for each volunteer, which can be problematic sometimes. In my own experience if not repeated regularly, elderly people get confused of the login procedure. Repurposing an old laptop for this use in each nursing home would help.

In any case, you might need to set some limits for person recording and make it supervised. Educate a caretaker for this purpose so that he/she can decide if the recording is satisfactory.

Indeed. As a voice donator, I’m also not happy with CC-0. Here is a related discussion from last year:


At least some of us, if not many are waiting some change in this area.

Those accents are defined at the start of the project, before the introduction of the variant concept. I also think they can be re-designed for further recordings, moving some to variants, and keeping some as accents in that variant. That would need working with Germen community here, with linguists and the project team, especially with @ftyers .

You can write to the sub-Discourse, Matrix or maybe @mkohler can help.

I unfortunately can’t. I am not active anymore (okay, considering myself active when it comes to the speaking/listening parts is a stretch… I only ever really did the sentence collector/extractor parts and not actual German related tasks). I have absolutely no idea who is currently contributing in German.

1 Like

Well, regardless of dataset quality (like in bias issues), many people tend to move away whenever enough data is accumulated. Also, after release of multi-modal large language models from big boys and huge processing requirements of newer models played a role, I think…

BTW, there is also a Matrix group for CV German, I’M not sure if it is active thou:
https://chat.mozilla.org/#/room/#common-voice-de:mozilla.org

Hi @BarttheHippo

@bozden thank you for your extensive response. @bozden has addressed most of the inquiries. To summarize and clarify:
Hosting an Instance of CV: CV already includes the functionalities you mentioned, such as accent, variants, and domain-specific data collection. Communities can contribute data for specific domains, accents, and variants. If the variant you want to contribute to isn’t currently listed, please let us know, and we’ll add it.

Campaigns: As mentioned, you can create a campaign page separately and link it to the CV page for contributions. We’re here to help with campaign designs, but unfortunately, we don’t have the budget to run campaigns.

CC0 Licensing: We do have community guidelines in place. While our control is limited due to the CC0 licensing, we are exploring alternative ways to promote responsible use practices.

If you need specific assistance, feel free to email me at gina@mozillafoundation.org.

Thanks
Gina

1 Like