Evolution of Community Support Desk: Share your views

:wave:t6: Hey Common Voice Community

A few weeks ago we launched the Community Support Desk, as a drop-in session for community members to answer or work through friction points they have. The drop-ins happen bi-weekly and are hosted for zoom.

During these sessions, Nart/Daniel from Abkhazian Community shared really valuable feedback as to how we could evolve the community support desk.

I really want to thank @daniel.abzakh , for their creativity and passion in shaping the evolution of the community support desk. Thank you Alex and Francis for also taking part in reviewing this idea.

TLDR: Nart and I would like to hear from the community your views on the evolution of the Community Support Desk. Please feel free to add your comments.


New Community Support Desk

Overall Aim: The community support layer of Common Voice is open to all languages to help break the barrier. Breaking the barrier means; creating an adopted usable voice recognition system in a community.

We want to focus on solutions that fit low resource languages*, and can be applied widely; keeping low resources in mind leads to unique solutions that are effective and robust.

Expectations: knowledge transfer, robust solutions, recommendations, project booster.

Graphic 1: The Language journey process in the community support desk

Alternative Text for graphic: The image three-circle Venn diagram, describing the journey for a language on the community support desk. This is described in full in the rest of the post.


Stage 1: Information gathering

Aims: Contact current language teams in Common Voice to get a better overall understanding.

Questions that should be asked:

  • What are the goals of the language community?
  • What are the challenges that are unique to the community?
  • What solutions did they come up with?
  • Do they have a product for their language?

Expected result: Provide documented friction points via communications platforms (e.g Mozilla Pulse, Mozilla Wiki, Playbook) that will allow for communication of robust solutions that are up-to-date.

Alpha Stage: Once the team members are happy to continue with the Community Support Desk they enter the alpha stage.


Stage 2: Knowledge Transfer

Aims : Knowledge transfer to teams in stagnation about how other low-resource languages have been able to break the barrier and build with Common Voice Dataset including:

  • Arrange Zoom call sessions with language communities who can share their experiences with breaking the barrier
    • Community Support Desk Team reach out to the language communities
    • Demo community case studies, explain concepts, and whatā€™s possible.
  • Async and synchronous online workshops on topics such as AI training, Shell scripting, Open source concepts and Community building

Beta Stage is a reflection point for understanding if people have the energy to continue the process.


Stage 3: Activation stage

Aims: Activating language teams to break the barrier (creating an adopted usable voice recognition system in a community) by providing constant support in the following:

  • Resource exploration and allocation for example CC0 Sentence collection, Local organisations support and volunteers
  • Technical support on web localization and text processing and clean up.
  • Encourage people to use the dataset for their community and with their community
  • When a language breaks the barrier, they have the option to contact major platforms that you interact with daily - to activate the language in their voice service (democratise big tech). Infiltrate for transparency with the dataset and develop beyond just for money by putting back to the community.

Release Stage: an opportunity to reflect and evaluate the impact of interventions and hopefully inspire future evangelists to take part in the support desk.


Thanks so much for reading this post. If you have a moment please comment your feedback below or/and respond to our anonymous poll.

Question 1: My language community(s) can be described as

  • Endangered or Vulnerable low-resource Community
  • Low-resource community
  • High-resource community

0 voters

Question 2: Would you find value in taking part in the Community Support Desk, as a mentor or mentee?

  • Yes
  • Unsure
  • Not relevant or donā€™t have the time

0 voters

Question 3: What stage of the community support desk, would help your community the most right now ?

  • Information Gathering
  • Knowledge Transfer
  • Activiation

0 voters


4 Likes

A general idea for stage 3:

  • Resource exploration and allocation for example CC0 Sentence collection, Local organisations support and volunteers

We are already sitting on a huge pile of sentences created by Mozilla (in the past) and saved in Mozilla cloud!

For example: Discourse, internet reports,moz festival reports, faq and release notes of up and running moz projects (VPN, Firefox browser(s), Fx mobile browser(s). ā€¦, discontinued moz projects), released public online material from moz foundation. Just to name a few possibilities.

Extracting those sentences from the moz projects and removing smileys, abbreviations completing punctuation (full stops in headlines),
checking for amount characters for the sentence,
transforming gender text to a clearly (one possible way of reading) readable version, and so on would be the first task.

In the first place the english sentences would increase, but later after translating the verified english sentences on pontoon they could be used for the new started languages and also added to the already up and running languages.

If this is not possible, because of in the past attributed copyright conditions, here is another idea:

Create and include a process within mozilla (projects and foundation) where the creator(s) of (future) text/reports/faqs and so on is asked to donate his text for Common Voice. Translating this text via pontoon.
Including via CV sentence collector or bulk.

Also to think of:
Are there any free avaiable automatic text generators for the puplic (open source!) to generate new text for the public domain/ cc0 sentences?



https://pile.eleuther.ai/

Are the ā€œmajor platformsā€ interested in donating auto generated sentences for CV as public domain cc0 sentences? Donating normal/ not artificially generated) sentences?

2 Likes

Iā€™m not sure I understand this, could you give some examples?

I think solutions should be outfitted to low resource languages then generalized.
A question should be asked, is this solution applicable to a low resource language?

I think itā€™s interesting to use Pontoon as a CAT (Computer-Assisted Translation) platform, and build parallel corpus text Banks for low-resource languages. But in that case the original text of the low resource language should be used then translated to English or other high resource languages.

It would be a cool feature in Common Voice to allow contributors to translate the sentences and to record their voices. That way you allow to build ASR and NMT with MCV

1 Like

Thanks for the reply.

Iā€™m not sure I understand this, could you give some examples?

Here (Germany) is an ongoing discussion if public texts should be gendered or not. Some town/land officials and ministries or german news websites do this, others deny to do so.

An Example:
The english term ā€œuserā€ translated to german is ā€œBenutzerā€ (male term in grammar and can be used for male and female users, normally).
In gendered texts this is written in male and female form in one word and in the same word, like "Benutzer:innen, Benutzer(innen), BenutzerInnen. The challenge now is how to read this loud. Multiple ways of speaking this are possible.
The flow of reading and speaking is interrupted and that is the main reason to deny this.

1 Like

I think itā€™s interesting to use Pontoon as a CAT (Computer-Assisted Translation) platform, and build parallel corpus text Banks for low-resource languages. But in that case the original text of the low resource language should be used then translated to English or other high resource languages.

It would be a cool feature in Common Voice to allow contributors to translate the sentences and to record their voices. That way you allow to build ASR and NMT with MCV

My initial thought was taking english sentences translate them to new language as a starting point for the new language. Own sentences from the contributors of the new language are higher in priority for recording (also a motivation factor), the rest of the english translated sentences are used as a ā€œfallbackā€ to record.

With your suggestion high and low resource languages would both benefit from thisšŸ˜„
The door for standardisation of the different language models could be opened by this! (All different languages and later trained language speach models using the same sentences) :star_struck:

1 Like

In general I think it is a very bad idea to use translated text for training ASR systems, especially automatically translated texts. These are not natural and would need to be checked by a human, who could just as easily come up with their own sentences.

A better option would be to come up with a way of prompting for collecting dialogue-like sentences, for example a chatbot might work or just a game or chat room where multiple people can talk to each other in the knowledge that what they write will be under CC-0.

1 Like

Also an additional way of getting sentences.
As long as everyone is using full sentences with punctuation everything is good, but when:
internet slang (rtfm, afk, lolā€¦)
fOOl language (cu l8erā€¦),
sc3n3 t4lk,
or writing in emoticons is used, most of the sentences must also be validated/removed or corrected.
Especially in public chat rooms and games.
Also in most games the chat via keyboard is disabled, because of some spammers/trolls cannot behave on public servers. Using voice to chat is mostly used in games on private servers.