Sentence Approval Upvotes

KOmondi · April 19, 2021, 3:22pm

How many upvotes do we need for a sentence to be approved? I have a small community in Swahili and I have done almost 1K approvals yet the published sentences are less than 200.

mkohler · April 19, 2021, 6:21pm

A sentence needs 2 upvotes out of 3 votes to be approved:

2 votes in total, 2 out of them upvotes -> approved as a 3rd vote won’t change anything
2 votes in total, 2 out of them downvotes -> rejected as a 3rd vote won’t change anything
2 votes in total, 1 upvote and 1 downvote -> next vote decides

KOmondi · April 19, 2021, 6:23pm

Thanks. This is helpful

sdenton4 · April 22, 2021, 8:46pm

Hi, Michael; thanks for the feedback on how the system works.

I’m curious how new sentences are selected for review; is it uniformly random amongst the available sentences, or is there a preference for sentences that already have a vote?

We’re trying to get to 5k approved sentences so we can launch the Swahili voice collection. We uploaded 35k sentences from the Swahili wikipedia. A back-of-the-ipython-notebook simulation indicates that to get 5k approved sentences with random selection we’ll require ~22k total votes if sentences are selected at random from a set of 35k, vs ~10k if it’s not uniformly random selection. (So, doubles the amount of work needed to get to launch.)

mkohler · April 23, 2021, 3:57pm

Sentences in the review queue are sorted by most votes. This makes it most likely for sentences to get approved quickly instead of spreading out votes all over the review queue.

Could you share a link to where those sentences are from? I see it marked as source “Wikipedia Public Domain Dump files”. If these indeed are Public Domain, then all good. However if these are coming from normal Wikipedia articles, they might not indeed be Public Domain. We have a process for these cases using the Sentence Extractor, which makes sure that we only extract 3 sentences per article maximum (legal requirement).

I see what you did there and this deserves its own shoutout, well done!

Thanks!

KOmondi · April 23, 2021, 4:33pm

Ooh… This is awesome then, I was discussing this with @sdenton4 yesterday thus the question. I must have gotten the wrong information from an online post.

KOmondi · April 23, 2021, 4:43pm

In this case, we got the sentences from the dump files as illustrated in the guide at cv-sentence-extractor.
First, since this link English Dump Files would give us the English Dump files, we just changed the “en” part of the link to “sw” resulting to https://dumps.wikimedia.org/swwiki/latest/swwiki-latest-pages-articles-multistream.xml.bz2 and that’s how we are sure the sentences are from public domain files.

mkohler · April 23, 2021, 6:34pm

Wikipedia articles are not Public Domain, they are shared under the Creative Commons Attribution-ShareAlike License, which makes it unsuitable for inclusion in Common Voice by default. However we are allowed to include a maximum of 3 sentences per article as per a legal agreement. This process needs to be run by the Common Voice project to guarantee that no more than allowed are extracted.

This is the reason why we have the mentioned Sentence Extractor. Here’s how the process looks like:

Rules get developed and verified by the Community
Once the error rate is low enough the rules get merged
This triggers an automatic sentence extraction which guarantees that only 3 sentences per article are used
The resulting output is added to Common Voice without going through the Sentence Collector

This has a few advantages that will also benefit you, including not needing to review every single sentence but just a sample. Happy to answer any questions you might have either here or in the #common-voice-sentence-extractor:mozilla.org room on Matrix.

As we can’t guarantee that legal requirement in Sentence Collector, I had to remove these sentences.

KOmondi · July 22, 2021, 4:10pm

Hello @mkohler, so we have so far worked on the comments you mentioned here and I have done a pull request on it. Please check and let us know what you think. Thanks