Hallo Bart, welcome !
I’m a volunteer here, mainly working on Turkish dataset, but also work on analysis of all CV datasets, and training models. What you are doing is great, I hope everything goes well.
Before the team members reply, let me try to give my views on your first two questions with some extra info:
- One can use their own server to collect data, but that would be all against your ideals represented on your homepage. That dataset will be your own, creating another “asymmetry” and will not be part of CV. CV does not accept external datasets to be imported.
- There can be some volunteers who might help you setup such a system, and in case of issues you can also get some help from from the matrix channel. But beware, the system is actively developed and if you keep it updated (also bugs fixed) you would need somewhat constant support as this would be a multi-year project.
More:
- CV had/has a rather active leading team working on German dataset, there is also a sub-Discourse here.
- CV recently introduced the Domain concept, healthcare is one of the options. So one can add domain specific sentences and people will also be able to select to record from their loved domains.
- There is also the Accent and Variant concepts, and there are a lot of German accents already defined (no variant yet, but I think you can still suggest here). For them to be of use, one should create an account and fill that info in their profile. At the end of this post you can find current pre-defined German accents on file.
- Except some closely monitored/curated ones, nearly every dataset, also datasets in CV have biases, and German is also biased in terms of gender. As this is a crowd-sourced project, without some directed campaigns, that bias cannot go away. This is something we all try to correct. But as German dataset is one of the largest ones, one could also use a curated/balanced subset of it to finetune an existing model for example. For detailed analytic info on German v18.0 you can check here.
- One limitation of CV: The released datasets are CC-0, thus the text-corpus must also be CC-0. So, you cannot just copy text from books and paste here. You should either use public domain books (mostly old), or people should generate their own sentences and donate them. I can see you have a nice network of interested people, if each one writes 100 sentences, you will have 300.000 domain specific healthcare sentences for your text-corpus.
So:
- I would suggest you to use the existing structure
- Make contact with the German team here
- Analyze the dataset to pinpoint the deficiencies and plan to correct them
- Create a campaign
- To collect domain specific sentences
- To record them (you should access women and non-Germany locales or people with accents like me)
Ask here, on Matrix, or on DM anytime.
Viele Grüße
German Accents
268 de preset germany Deutschland Deutsch
269 de preset netherlands Niederländisch Deutsch
270 de preset austria Österreichisches Deutsch
271 de preset poland Polnisch Deutsch
272 de preset switzerland Schweizerdeutsch
273 de preset united_kingdom Britisches Deutsch
274 de preset france Französisch Deutsch
275 de preset denmark Dänisch Deutsch
276 de preset belgium Belgisches Deutsch
277 de preset hungary Ungarisch Deutsch
278 de preset brazil Brasilianisches Deutsch
279 de preset czechia Tschechisch Deutsch
280 de preset united_states Amerikanisches Deutsch
281 de preset slovakia Slowakisch Deutsch
282 de preset russia Russisch Deutsch
283 de preset kazakhstan Kasachisch Deutsch
284 de preset italy Italienisch Deutsch
285 de preset finland Finnisch Deutsch
286 de preset slovenia Slowenisch Deutsch
287 de preset canada Kanadisches Deutsch
288 de preset bulgaria Bulgarisch Deutsch
289 de preset greece Griechisch Deutsch
290 de preset lithuania Litauisch Deutsch
291 de preset luxembourg Luxemburgisches Deutsch
292 de preset paraguay Paraguayisch Deutsch
293 de preset romania Rumänisch Deutsch
294 de preset liechtenstein liechtensteinisches Deutscher
295 de preset namibia Namibisch Deutsch
296 de preset turkey Türkisch Deutsch