We want your feedback: Improving the sentence collection

nukeador · January 9, 2019, 11:23am

Hello everyone,

I’m Rubén and I’m working at the Mozilla Open Innovation Team. During this quarter I’ll be investing more of my time to work with @mhenretty to help the Common Voice project, specially around Community strategy.

As we know, collecting sentences in different languages is an important step to advance the project, without valid sentences we won’t be able to offer something to read to people who want to donate their voice.

In the past we have taken different approaches to solve this issue, including a great sprint during May, but the experience and workflow is not yet great. We are still doing big clean-up efforts to collect existing sentences from very different sources (20-30K unreviewed sentences), don’t worry they won’t be lost

Goals

During this quarter we have a few goals in this front:

Fix and automate the sentence collection workflow.
Provide community with guidance on most productive ways to collect sentences.
Grow the number of sentences in top languages.
Get community engaged and influencing the strategy, ideation, and problem solving.

Current problems

We have identified some key problems

The existing workflow to gather, review and validate sentences is too complex and involves a lot of manual labor (context)
We don’t have a good coverage for popular languages, we don’t have enough sentences for them.
We don’t have a clear success guidance to provide to contributors, or documented the existing ones.
Contribution paths are spread across different places, the contribution experience is not good (sprint site, discourse and github)

Your input

We would like you to get involved in this conversation, where we want to identify other issues you think the sentence collection has. As a result we want to create a detailed document with all the things we need to solve and the requirements.

I would like to avoid jumping into suggesting solutions and keep the conversation in the problem space, so I have a few questions for you all (no matter your previous involvement with this project, your voice is important):

What are the issues you think the sentence collection phase currently has?
How would you envision a perfect sentence collection workflow in the future? (imagine no constraints)

Thanks for your opinion!

We’ll keep an eye on this topic during this week in order to produce a requirements doc draft by Friday.

sixtease · July 23, 2018, 11:48am

It’s hard to suggest an improvement to something that’s not described. Please could you at least outline one of the current way to collect the sentences?

nukeador · July 23, 2018, 11:57am

You are right, I’ve added some links in the first message.

Currently we have three different paths for collecting sentences:

The review process is currently on hold, we used crowdin during the sprint, but this also involved running a few manual steps to ensure correct length, duplicates… through an intermediate spreadsheet. Which drained a lot of time and relied on just one person.

lissyx · July 23, 2018, 12:17pm

Having worked on tooling for french, some of the issues that we had to deal with and tackle were:

sourcing valuable cc0 data for mass-import
proper output text normalization

Text normalization involves:

making sure any abbreviation is being expanded (like Mr, Ms, etc., any abbreviation)
numbers being translated to text, whatever the form of number (arabic, roman, …)
ensuring good balance of the sourcing material
avoiding repeating words in the dataset

IMHO, any workflow described and being implemented should ensure all those properties are being respected before injecting into the common voice crowdsourcing dataset.

Current codebase is still a bit tailored toward french, but we’d be happy to expand to more locales: https://github.com/lissyx/commonvoice-fr

nukeador · July 23, 2018, 12:23pm

What do you mean by this? Can you elaborate a bit more? Thanks!

lissyx · July 23, 2018, 12:31pm

Sure. Right now, it’s mostly done by hand, but the thing is we know we need material from different kind. For example, we can extract 2 millions sentences from debates of the French Parliament. But that would not make a good enough dataset. We need other kind of sources.

Currently, we have:

global sprint user contributed data
french parliament debates from 2013 to 2018
project gutenberg french books
breton-sourced french material
(upcoming) libretheatre, featuring french plays from the 19th and 20th century

And we try to balance the amount of lines to be mostly the same. Given breton has ~7k sentences, we’re re-importing data from french parliament, project gutenberg to match those 7k (as well as importing ~7k sentences from libretheatre).

nukeador · July 23, 2018, 1:33pm

fred_trotter · July 23, 2018, 2:20pm

I have been working on a project to backup comments made to federal regulations.

It is my understanding that these comments, since they are a part of rule-making are public domain (IANAL).

This should allow me to have massive amounts of conversational and formal style english sentences to contribute, and I would like to do so in a programatic way.

Fred Trotter

Djfe · July 23, 2018, 3:50pm

I’ll take a lot at this again later, because I’m missing time.
One thing that comes to my mind though:
sentence translation

this doesn’t work for every sentence obviously and it’s not supposed to be a 100% translation.

just: you get sentence 1 in language a (likely English)
now form a similar sentence in your language.

some sentences in German for example are written like commands for AIs like siri so they should easily work in other languages, too

also:
what should we do about different grammar/wording in similar languages?

Dutch Dutch vs Belgian Dutch
French French vs Canadian French
Swiss German vs Austrian German vs High German

Max. sentence length
allowed characters
forbidden characters

frequencies of genders, articles and names
pronouns etc.
https://github.com/mozilla/voice-web/issues/902

https://github.com/mozilla/voice-web/issues/756

swear words and collection of sentences with them

Words that we need more of (comparisons to words on/in wikipedia and their frequency there)/that are missing in our Korpus of that language

translation of Word Korpus into other languages
maybe import of Word Korpus from other datasets (than just Wikipedia)
this is likely extremely helpful for languages with less people that contribute (we can give them a starting point)

Sentences with variables which can be replaced with names from cities
https://github.com/mozilla/voice-web/issues/889
https://github.com/mozilla/voice-web/pull/991

if there will be an interface for sentence collection, then it could be helpful to be offered topics/words/names/cities that the dataset lacks while writing.
Maybe another endpoint where people can enter words, cities, names that need to be in the dataset.

how should the different forms of each word be treated in that regard?

awyman · July 23, 2018, 3:59pm

I know you specifically asked for workflow concerns here, and this concern falls under editing as part of your pipeline. If this doesn’t apply then please understand I’m not trying to be inflammatory.
I’ll speak up as a participant for continuing to question the content of the sentences provided. You all were great at responding to the notice of the text missing an equal representation of pronouns. My participation drops when reading or hearing sentences that aren’t neutral- either too moralistic or pulled from religious culture, carrying social-political agendas, slyly sexist and even trite. Plus, parables or homilies or dated phrasing seem a bit suspect, too.
Is content a concern? Are the submissions edited? Are there guidelines to follow or just good judgement used? Are there localization consultants involved?
Feel free to point me to other discussions that may have covered this issue focusing on the project as an artifact. I read over flyingpmonster’s great analysis

r_HN42fyO-WKqJJnvvAS8F-w · July 23, 2018, 4:13pm

I think it is a problem in case of french. How do you decide how to write and then pronouce ‘1984’? Either you choose one and it excludes all other ways or you need to write all the ways this number can be pronounced:

mille neuf cent
- quatre vingt quatre
- huitante quatre
dix-neuf cent
- quatre vingt quatre
- huitante quatre

It depends on which country / region the person is. But common voice needs to be trained on these particularities, too.
I expect there are the same issues for spanish.

Another issue I see, is that we can’t correct (or submit a correction) for a blatant typo directly in the tool. It would be great, if it at least pointed to the github PR or something like that.

lissyx · July 23, 2018, 4:34pm

Yep, for now I decided to keep the first alternative as fr-FR, but this is clearly one of the issues that should be tackled as a more generic tooling. Happy to take any PR to improve that actually, since doing validations on french common voice I heard some belgians accents

r_HN42fyO-WKqJJnvvAS8F-w · July 23, 2018, 4:41pm

I also heard canadian accents and we need more canadian sources, too. Because there are words that are only used in Canada and not in Europe. Are their public debates available online (at least the ones from Québec)?

geraldobarros · July 23, 2018, 4:50pm

The campaign site does not support more languages, only English;
The campaign is temporary, not continuous, and it also has no gamification to motivate contributors to continue pushing long-term contributions;
There is no way to know the progress of language implementation, as we have pushed something around 2k of words in Brazilian Portuguese and we do not know when Common Voice will receive voices in Brazilian Portuguese - some contributors told me that they feel that the work was lost and not used by Mozilla, and I’m sure that in the next campaign they will not help.

lissyx · July 23, 2018, 5:06pm

This is one element we discussed about with @mikehenrty in the past, and it needs to be dealt with properly, I agree. Only thing, I have not had time to explore those.

luc.salommez · July 23, 2018, 6:17pm

Hi, a lot of propositions have been made and some of them are great.

I’ll just add that i’d like a system where we either can validate a written sentence from someone before adding it to the pool or report the sentences when we have to speak them because sometimes we can find very strange sentences (incorrect vocabulary, not the right language, typos etc …).

Also for French, a lot of sentences are coming from automatically parsed datasets (and I thank the French community for its efforts), where some of them are too long to fit in the allowed recording time or are unreadable.

I guess it is then hard to check every sentence, but if we could at least report strange sentences (and maybe remove them after X reports ?) from Common Voice that would be great.

Also, as it has been said in the first post, it is really hard to find a place to contribute sentences so creating a place on Common Voice to contribute would be a huge improvement to me.

Thanks for you work !
Luc.

lissyx · July 23, 2018, 6:27pm

For french, we filter on sentences to be within limits of 3 to 15 words.

r_HN42fyO-WKqJJnvvAS8F-w · July 23, 2018, 6:29pm

Yes but 15 french words including 4 or 5 with 4+ syllabs, it takes a lot of time to read. I saw some of these, and they are awfully long.
Agree with Luc, having a button to report sentences that we think are not relevant would be nice.

lissyx · July 23, 2018, 6:35pm

As much as I remember, the website allows for at least 10 secs of recording, and I don’t think it’s a hard limit, but I don’t know the details. Any contribution to improve the building of the dataset is welcome.

And a button to report bogus sentences would help improve

belkacem77 · July 25, 2018, 11:27am

Hi,

I’m imagining no constraints as it’s said.

I’d like suggest to tag sentences dealing with Vulgar or Sexual words. We can’t avoid such a vocabulary. So when collecting, we can tag them as sexual, and when exposing them either to record or listen, an option would be added to the profile to select if he would listen and record such sentences.