Common Voice for Healthcare (Edge Cases)

Hallo Bart, welcome !

I’m a volunteer here, mainly working on Turkish dataset, but also work on analysis of all CV datasets, and training models. What you are doing is great, I hope everything goes well.

Before the team members reply, let me try to give my views on your first two questions with some extra info:

  • One can use their own server to collect data, but that would be all against your ideals represented on your homepage. That dataset will be your own, creating another “asymmetry” and will not be part of CV. CV does not accept external datasets to be imported.
  • There can be some volunteers who might help you setup such a system, and in case of issues you can also get some help from from the matrix channel. But beware, the system is actively developed and if you keep it updated (also bugs fixed) you would need somewhat constant support as this would be a multi-year project.

More:

  • CV had/has a rather active leading team working on German dataset, there is also a sub-Discourse here.
  • CV recently introduced the Domain concept, healthcare is one of the options. So one can add domain specific sentences and people will also be able to select to record from their loved domains.
  • There is also the Accent and Variant concepts, and there are a lot of German accents already defined (no variant yet, but I think you can still suggest here). For them to be of use, one should create an account and fill that info in their profile. At the end of this post you can find current pre-defined German accents on file.
  • Except some closely monitored/curated ones, nearly every dataset, also datasets in CV have biases, and German is also biased in terms of gender. As this is a crowd-sourced project, without some directed campaigns, that bias cannot go away. This is something we all try to correct. But as German dataset is one of the largest ones, one could also use a curated/balanced subset of it to finetune an existing model for example. For detailed analytic info on German v18.0 you can check here.
  • One limitation of CV: The released datasets are CC-0, thus the text-corpus must also be CC-0. So, you cannot just copy text from books and paste here. You should either use public domain books (mostly old), or people should generate their own sentences and donate them. I can see you have a nice network of interested people, if each one writes 100 sentences, you will have 300.000 domain specific healthcare sentences for your text-corpus.

So:

  • I would suggest you to use the existing structure
  • Make contact with the German team here
  • Analyze the dataset to pinpoint the deficiencies and plan to correct them
  • Create a campaign
    • To collect domain specific sentences
    • To record them (you should access women and non-Germany locales or people with accents like me)

Ask here, on Matrix, or on DM anytime.

Viele Grüße

German Accents

268	de	preset	germany	Deutschland Deutsch
269	de	preset	netherlands	Niederländisch Deutsch
270	de	preset	austria	Österreichisches Deutsch
271	de	preset	poland	Polnisch Deutsch
272	de	preset	switzerland	Schweizerdeutsch
273	de	preset	united_kingdom	Britisches Deutsch
274	de	preset	france	Französisch Deutsch
275	de	preset	denmark	Dänisch Deutsch
276	de	preset	belgium	Belgisches Deutsch
277	de	preset	hungary	Ungarisch Deutsch
278	de	preset	brazil	Brasilianisches Deutsch
279	de	preset	czechia	Tschechisch Deutsch
280	de	preset	united_states	Amerikanisches Deutsch
281	de	preset	slovakia	Slowakisch Deutsch
282	de	preset	russia	Russisch Deutsch
283	de	preset	kazakhstan	Kasachisch Deutsch
284	de	preset	italy	Italienisch Deutsch
285	de	preset	finland	Finnisch Deutsch
286	de	preset	slovenia	Slowenisch Deutsch
287	de	preset	canada	Kanadisches Deutsch
288	de	preset	bulgaria	Bulgarisch Deutsch
289	de	preset	greece	Griechisch Deutsch
290	de	preset	lithuania	Litauisch Deutsch
291	de	preset	luxembourg	Luxemburgisches Deutsch
292	de	preset	paraguay	Paraguayisch Deutsch
293	de	preset	romania	Rumänisch Deutsch
294	de	preset	liechtenstein	liechtensteinisches Deutscher
295	de	preset	namibia	Namibisch Deutsch
296	de	preset	turkey	Türkisch Deutsch