Has Mozilla / Common Voices been approached by public bodies/agencies with strategies to increase voluntary voice gathering and validation?

Hello There!!

I work as a state prosecutor in Brazil. I was wondering if there has been any strategies to increase language corpus. One possible strategy I see is to accept voluntary voice gathering and validation as community services.

I can imagine a separate text corpus that, after internal validation, could be coupled to the brazilian portuguese database for further validation.

Has someone tried this?

What do you mean by “internal validation”?

Those who are doing for public service may not be so throughout on validation as voluntaries on common voices. Also, maybe, we´ll need better control to account those hours for judicial purposes, so we may need to somehow replicate common voices system.

It sounds a great idea to include that in a public service program.
To have an efficient voice recognition system you need different voices.

Taking that in mind you could set up a community service program as the following:

  1. The person in question should record 250 sentences, and validate 700 clips.
  2. Next, the person in question should invite 10 people to contribute similar amounts each.
  3. The person in question should gather screenshots of all these contributions and send it to you for confirmation.

I´m afraid this wont work, as we may need to inspect the voice (the person may ask someone else to transcribe/speak/validate on his/her username). Also, he may only print the images and pass to the next one without doing what he was supposed to do.

A new feature has been implemented in Common Voice where you can download your own data and profile information, that way you can can inspect the voice.

You can ask so that they download their data and profile, then pass it to you in a flash drive.

Here is a screenshot:

Hey!

Also worth to mention:

Portuguese is up and running:


I guess this is the european portuguese?

The portuguese (brasilian) has other grammar and pronounciation campared to european pt. spoken by 200 million people.

I do not understand pt, so check also the link above.

Hey Alberto,

Welcome to the community and thanks so much for your suggestion.

Some communities (Kinyrwanda and Frisian) have worked with local government to support community campaigns and activities in Common Voice.

Currently all language corpus’ that take part of Common Voice are under CC0. Some language communities have previously worked with copyright owners in media to grow sentences which are part of the languages corpus. They have followed with community mobilisation encouraging people to contribute and validate voice clips by reading out these sentences.

I would like to suggest the following steps currently available to you:

  • Insure the text you want to contribute is under cc0 or is dedicated to it (check out the cc0 waiver process)

  • You can do a bulk submission for Protugese and validate the sentences before submitting them onto the website before voice clip collection.

  • Moblie colleagues and community members to contribute and validate voice clips

You might be intrested reading about this blog reagrding the introduction of variants for languages.

If you have any other questions please let me know.

1 Like

Thank you for all the information. I think the best solution would be using a client over common voice. I saw common voice android project by Savio Morelli and one idea that came up to me was an extra api where we can gather the extra data necessary to overview these voluntaries activies.

On the Kinyrwand/Firisan case, did Mozilla help those communities overseeing their activities. For us, it would be great to have some feedback on some decisions and, the initial project being sucessful, institutions from other states may join us.