For now, random sentences are given to the user to spell, the user is doing so just to help the community, I may feel that I’m wasting my time a little because these sentences don’t have any sense to me. I suggest providing the user the possibility to select the domain of the sentences he is going to spell. For example, maybe I’m interested in geography, and maybe spelling geography-related sentences will be more beneficial for me, so I may spend more time reading the sentences.
Adding Categories to the sentences could be a good things IMO, mainly to distinguish between local variants of a language, e.g. British English and American English.
But I think adding categories like “Geography” to every of the millions of sentences would be an absurd amount of work, that simply isn’t worth the outcome. It simply isn’t practical.
I agree. But the problem that I see is especially for low-resource languages like Arabic. People don’t get the importance of participating in the community. So If they can get something in return they might be interested. At least if they can read things that are interesting to them. I know that making this a win-win thing is maybe difficult. But it is it will increase participation rates.
Hey @J-Mourad thanks for sharing your idea.
As part of the product roadmap we would like to build functionality to highlight the domain of text corpus (collection of sentences) e.g medical or agriculture to support a variety of text.
We hope to consult with contributors regarding in the next few months.
I think I’ve shared this before, but I’ll do it again anyway: when I was doing some work for Microsoft about a decade ago.they gave each permanent member of staff a fancy new Windows phone, which they got to keep if they could capture so many people’s voices (25 IIRC) using an app that was preinstalled. So naturally contractors were fought over by permies as a route to the free device.
The sentences it asked me to read looked like they were lifted right out of MSN messenger and bing searches - terrible spelling and grammar, txt spk, abuse of the numbers 2 and 4, missing punctuation, and not a capital letter in sight. But because Microsoft are gobal and the devices had location enabled, I assume it automagically tuned for regional accents in major cities all over the world (accents have a shape across age, space and socioeconomic status)
So IMO common voice needs sentence collectors that work with chat apps, because that’s how people actually talk. Tagging the sentence sources with geo-location data might help for localisation too, if people were willing to share that along with the selected sentences. Grading sources on formality and tagging them by subject would also be useful, as per OP’s suggestion, but I think most of the sources we do have are not really “for voice”; they are tuned for internal monologue rather than being utterances.