New Language Workflow: 5K Sentences Requirement?

heyhillary · October 22, 2021, 4:32pm

TLDR: How could we change the 5000 sentence collection requirement to be more inclusive of a language community needs ? e.g low-resourced, not many speakers

Hey everyone,

At the contribute-athon sessions, we discussed some of the ideas the Common Voice team has for New Language Workflow for Common Voice.

I want to ensure as many people can be involved in this discussion so, I have created this topic.

The language workflow is the process in which a language joins the Common Voice website for voice data collection. See this comment to understand how it works.

We want to improve the language workflow this includes but is not limited to; centralising documentation by including and evolving the Community Playbook onto the Common Voice Website.

To help us we would like to listen to the community thoughts on two questions:

How could we change the 5000 sentence collection requirement to be more inclusive of a language community needs ? e.g low-resourced, not many speakers

What documentation did you wish you had or still need to support your language’s journey being launched onto Common Voice for voice data contributions ?

We look forward to hearing your thoughts !

heyhillary · October 22, 2021, 4:34pm

The current Language Workflow

A person(s) requests a language via the website, github or community platforms.
The person, connect begins to localise the platform via pontoon.

The Common Voice platform text is 90%+ localised via pontoon by the person and contributor translators
Roughly 5,000 public domain (cc0) sentences are collected via sentence extractor and sentence collector

The language is live

robovoice · October 22, 2021, 6:39pm

Hi!
I am answering from the perspective of a long time contributor to CV (so am not into newly started languages at all)

It took a long time for me to find out how to contribute new sentences and with what (Common Voice sentence collector).
After crawling through discourse i found it out.

The 5000 sentence requirement i would not change (worked for many languages so far), but the way to achieve it i would optimize.

On the CV main page are right now 2 sections,
speak and listen. A third section could be included (for example: new sentences) to show the contributor all ways of contribution to CV project right from the start.

The logic behind this: no new sentences, no new recordings, language section comes to a stop.
Or cannot start to record because of not enough sentences collected so far.

In the new third section is explained what common voice sentence collector is and a link to it and that cc0 sentences can be contributed for existing and newly started languages.

Also the possibility of uploading bigger text files (how to do it and what criteria is needed for doing so) is explained.

bozden · October 22, 2021, 6:46pm

I want to repeat my suggestion during the meeting(s), for the general public:

Some major troubles the new & under-resourced languages are facing are:

Finding 5000 CC-0 sentences
Finding the first group of volunteers to record these.
Translating the UI through Pontoon (90%)

I learned from the meeting that the “5000” sentences requirement is based on AI/ML-related technical minimum, namely, you would need that minimum to be able to build an initial model.

I don’t think that would apply here…

I suggest, for languages with low amount of native speakers (like cencus data), that requirement may be relaxed. Few people speaking the language, few sentence resources, few volunteers, etc.

I think setting a minimum threshold for 5000, such as 1.000.000 native speakers (just an example), and require half of it (2500) if the speaker number is 500.000 for example. This is very important especially for dying languages IMHO…

When you adapt such a process, CV will be the anchor point for new volunteers… These volunteers will keep working on completing UI, sentences, recordings. Otherwise 5000 sentences will be done with very few people, making the first dataset very biased anyways.

bozden · October 22, 2021, 6:48pm

This IS utmost important, as I mentioned everywhere… Like this one:

robovoice · October 22, 2021, 7:08pm

Yes, and many contributors do not know about contributing (small amounts of) own sentences and larger text files and. …leave.
What do you think of the idea of implementing this on CV main page in a new third section? (creating own sentences - every sentence counts for existing and newly started languages, how to submit larger files)

On preserving language(s) lowering the requirements could be an option with the danger of running out of sentences. So CV must find a way to show this all contributors.

robovoice · October 23, 2021, 7:34pm

Also filed for a forwarded process for this on github

github.com/common-voice/common-voice

[req] third section for contributor on CV main page

opened 07:23PM - 23 Oct 21 UTC

closed 03:07PM - 26 Mar 22 UTC

robovoice1

**Is your feature request related to a problem? Please describe.** A clear and …concise description of what the problem is. Ex. I'm always frustrated when [...] 2 processes can be chosen from the contributer and are straight forwarded. The other ways of participating (contributing new sentences/text for CV project for existing and newly started languages, translating ui for newly started languages are missing. A third new section (e.g. new text or new sentences) on CV main page could be the beginning to show the contributor of a new forwarded adding new sentences/text process. Basically the 3 click or tap rule: In 3 clicks or taps you are there what you want to do. **Describe the solution you'd like** A clear and concise description of what you want to happen. Contributor chooses the new third section on CV main page. A new page pops up with following possibilities: Add new sentences - link to common voice sentence collector add section Review sentences - link to common voice sentence collector review section Add larger text files - link to add larger text files and explanation what criteria is needed to do so. Translating UI - direct link to languages in progress - to translate CV page. see picture Sentence extractor -? not sure about this one! **Additional context** Add any other context or screenshots about the feature request here. With this new section a constant flow of new sentences/text is forwarded. The CV project shows all ways of particition from the start. ![Screenshot_20211023-220148](https://user-images.githubusercontent.com/92784377/138570085-a0fdfade-c357-4d78-8f05-6537dcbb43f7.png)

All credits for this one to @bozden !

robovoice · October 25, 2021, 3:58pm

Finding 5000 CC-0 sentences

In the worst case there is no material online or behind paywalls.
Rare printed material is copyrighted, you have to pay for authors switching to cc0.
Knowledge for this language/dialect is passed mouth to mouth.

Finding the first group of volunteers to record these.
Are there any institutions to preserve this language/dialect (statewise)? National archive exist for this?
Are there private interst groups for this language dialect? On and offline! You do not reach those with social media campains, when they have only gsm phones.
Are there offline courses for this? evening schools for interested persons ?

bozden · October 25, 2021, 4:13pm

For those who are short on sentences, we have two methods we started using/will be using: (1) Volunteers write down utterances from their everyday speeches (2) We will open a CC-0 chatroom and invite everybody there to write something/converse there…

You can easily reach 5000 using methods like these… It is not mandatory to have CC-0 literary works. After that point, you can research more, once you have an anchor here.

robovoice · October 25, 2021, 4:25pm

Creating (own) sentences is the key function

irvin · October 25, 2021, 4:56pm

A good one that we can add to best practice! We take the same approach to open a Chinese CC0 chatroom at local civil hacking community slack in the past 2 years, now we had tens of thousands of sentence materials available to be use.

irvin · October 25, 2021, 4:59pm

This restriction is not necessary.

In the very beginning of Common Voice (likely in first year), we don’t enforce a “one sentence recording only once” rule. That is a request from DeepSpeech that its model only consider one sentences once.

We can definitely ease that restriction for that Common Voice is not serving to single specific project anymore.

robovoice · October 27, 2021, 4:34pm

One more way to get cc0 sentences:
Take 5000 verified cc0 english (or whatever amount is needed) sentences and translate those via pantoon to newly started language.

The automated translation results are mostly not good (classic example: user manuals)

ftyers · September 26, 2022, 2:50pm

In my opinion this is not a good idea. Translations will suffer from translationese and not be idiomatic. In this case, it is better to just come up with sentences from scratch. It will be around the same amount of work as translating, but the sentences will be more idiomatic.

Topic		Replies	Views
Weekly Update: New Language Workflow - 22nd October 2021 Common Voice announcements	3	1366	October 26, 2021
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14437	May 10, 2022
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	34	8982	December 17, 2018
We need a Q&A Common Voice feedback	5	2223	October 2, 2020
[Share your views] Nuancing sentence collection requirements - New Sentence Collection Bands Common Voice sentence-collection , feedback	2	2833	March 9, 2022

New Language Workflow: 5K Sentences Requirement?

Related topics