Has Mozilla / Common Voices been approached by public bodies/agencies with strategies to increase voluntary voice gathering and validation?

Alberto_Cunha · December 17, 2021, 8:56pm

Hello There!!

I work as a state prosecutor in Brazil. I was wondering if there has been any strategies to increase language corpus. One possible strategy I see is to accept voluntary voice gathering and validation as community services.

I can imagine a separate text corpus that, after internal validation, could be coupled to the brazilian portuguese database for further validation.

Has someone tried this?

daniel.abzakh · December 17, 2021, 9:53pm

What do you mean by “internal validation”?

Alberto_Cunha · December 17, 2021, 11:12pm

Those who are doing for public service may not be so throughout on validation as voluntaries on common voices. Also, maybe, we´ll need better control to account those hours for judicial purposes, so we may need to somehow replicate common voices system.

daniel.abzakh · December 18, 2021, 11:35am

It sounds a great idea to include that in a public service program.
To have an efficient voice recognition system you need different voices.

Taking that in mind you could set up a community service program as the following:

The person in question should record 250 sentences, and validate 700 clips.
Next, the person in question should invite 10 people to contribute similar amounts each.
The person in question should gather screenshots of all these contributions and send it to you for confirmation.

Alberto_Cunha · December 18, 2021, 6:20pm

I´m afraid this wont work, as we may need to inspect the voice (the person may ask someone else to transcribe/speak/validate on his/her username). Also, he may only print the images and pass to the next one without doing what he was supposed to do.

daniel.abzakh · December 18, 2021, 7:11pm

A new feature has been implemented in Common Voice where you can download your own data and profile information, that way you can can inspect the voice.

You can ask so that they download their data and profile, then pass it to you in a flash drive.

Here is a screenshot:

robovoice · December 18, 2021, 8:02pm

Hey!

Also worth to mention:

github.com/common-voice/common-voice

docs/LANGUAGE.md

main

# Language

Common Voice is always growing, and we welcome all new languages. There are two components to adding a new language to Common Voice:

- Make sure it is localized
- Make sure there are sufficient sentences to read

These two things can occur simultaneously

## Localization

In order for a new language to be activated on Common Voice, it must be at least 75% localized in that given language.

We use the [Mozilla localization platform Pontoon](https://pontoon.mozilla.org/projects/common-voice/) to handle translations of the web interface. Use the project page to find your language community and help submit new translations. If your language is not available for translation on Pontoon, you can request for it to be added by submitting a new issue using the [language requests template](https://github.com/mozilla/common-voice/issues/new?assignees=&labels=&template=language_request.md&title=).

For more information on how Common Voice approaches language and accents, please refer to our [language and accent strategy](https://discourse.mozilla.org/t/common-voice-languages-and-accent-strategy-v5/56555).


## Sentences

This file has been truncated. show original

Portuguese is up and running:

I guess this is the european portuguese?

The portuguese (brasilian) has other grammar and pronounciation campared to european pt. spoken by 200 million people.

I do not understand pt, so check also the link above.

heyhillary · December 20, 2021, 11:06am

Hey Alberto,

Welcome to the community and thanks so much for your suggestion.

Some communities (Kinyrwanda and Frisian) have worked with local government to support community campaigns and activities in Common Voice.

Currently all language corpus’ that take part of Common Voice are under CC0. Some language communities have previously worked with copyright owners in media to grow sentences which are part of the languages corpus. They have followed with community mobilisation encouraging people to contribute and validate voice clips by reading out these sentences.

I would like to suggest the following steps currently available to you:

Insure the text you want to contribute is under cc0 or is dedicated to it (check out the cc0 waiver process)
You can do a bulk submission for Protugese and validate the sentences before submitting them onto the website before voice clip collection.
Moblie colleagues and community members to contribute and validate voice clips

You might be intrested reading about this blog reagrding the introduction of variants for languages.

If you have any other questions please let me know.

Alberto_Cunha · December 20, 2021, 7:29pm

Thank you for all the information. I think the best solution would be using a client over common voice. I saw common voice android project by Savio Morelli and one idea that came up to me was an extra api where we can gather the extra data necessary to overview these voluntaries activies.

On the Kinyrwand/Firisan case, did Mozilla help those communities overseeing their activities. For us, it would be great to have some feedback on some decisions and, the initial project being sucessful, institutions from other states may join us.

Topic		Replies	Views
📖 Readme: How to see my language on Common Voice Common Voice announcements	40	14233	May 10, 2022
Common Voice Roadmap Update Common Voice announcements	7	2288	September 3, 2019
All Hands > Community Update Common Voice announcements	10	1651	August 30, 2019
Interplay between review, validation and actual use in common voice!? Common Voice sentence-collection	4	552	April 11, 2019
Volunteer to help to add Sanskrit and Kannada languages in the Common Voice project Common Voice participation	2	1044	December 16, 2020

Has Mozilla / Common Voices been approached by public bodies/agencies with strategies to increase voluntary voice gathering and validation?

Related topics