Help create Common Voice's first target segment

dabinat · May 21, 2020, 4:26am

Data tagging is a really great feature that could improve the dataset a lot. I had some questions though:

Will this data be mixed into all the other clips in a dataset release or will target segments be considered separate datasets?
To clarify, this is just to test the feature and there isn’t a specific intended use-case for gathering numbers?

mbranson · May 21, 2020, 9:39pm

Thanks @dabinat, great questions.

We’re aiming to have target segments downloadable as separate dataset segments. So this Single Word target segment would be made available separately from what we refer to as the General Corpus. The intent is for these release cycles to be linked. Meaning, we’re planning to release the Single Word segment data at the same time as the General Corpus data, with a release target of mid-year. (Note that we’ve never done a release under the bandwidth constraints imparted by the current pandemic situation and timelines are not concrete.)

Currently spoken digit collection is intended for benchmark testing an application on digit recognition. No other specific use-cases are outlined at the moment.

nukeador · May 22, 2020, 5:49pm

lidyachristina · May 23, 2020, 6:12am

Hello

The pull request of native verification for Indonesian language just merged and I am wondering how long it will takes to be updated in the Common Voice site?

nukeador · May 25, 2020, 10:13am

We usually do portal releases each 2 weeks, we have one planned tomorrow Tuesday 26th, but I don’t know if the new languages will be merged into the main repo by today.

@Joshua_Meyer how often you merge the new languages into voice-web repo? Can you make a PR today before tomorrow’s release?

irvin · May 26, 2020, 7:58am

Hi, how should we do if we had multiple different words in correspond to yes/no? should we list them all in P/R?

Such as currently Japanese had listed two type in each words, how will we deal with this in Common Voice?

tucan · May 26, 2020, 11:14am

(From another native speaker)
“Hei” is phonetically wrong if you assume Standard German, and lexically awkward.

If the spoken output you want is the word “Hey”, you should write it this way.

“Häi” or “Hej” would be (more) phonetically correct, but they look awkward and unfamiliar to German speakers, and everybody is familiar with anglicisms.
Additionally, “Hey” (in this exact form) is used frequently in everyday speech and writing (e.g. chats), making “Hei” look more like a spelling mistake.

To clarify: “Hei” looks like a bad translation, and is alien to written Standard German.

mbranson · May 26, 2020, 9:03pm

@irvin We’re working with @Joshua_Meyer to understand if multiple words corresponding to one word is an acceptable way forward for this benchmark test set. At the moment, we’ve only merged languages that have a 1-to-1 translation and it’s my understanding that Josh is validating the best corresponding word in each language. If you have multiples, for now it’s okay to list them all in the PR here and work with Josh to determine the best way forward for that language set.

mbranson · May 26, 2020, 11:02pm

@tucan welcome and thanks for that input. Sounds like you and @stergro are aligned re: the German preference for “Hey” over “Hei”. Appreciate that feedback. We’ll be sure the Firefox Voice team gets this info and work with @jofish to determine best way forward prior to our next release (June 9th). Update: This issue has been submitted to Common Voice github for transition of Hei to Hey for German collection. This will go out with the June 9th release.

Also @stergro it’s great that you’re taking on validating more via a private tab. For recording, it’s important that we have as many unique voices as possible, so please avoid recording these target segment words more than once.

dabinat · May 27, 2020, 6:43pm

@mbranson Is there a particular goal or minimum in terms of number of contributors or hours recorded?

Do the progress bars on the Languages page count target segment contributions as part of the total? They shouldn’t if it’s a separate dataset.

iveskins · May 28, 2020, 11:44am

Just a thought. I say digits a little differently when I am speaking them carefully one at a time, compared to reeling them off in a long multi digit number.

nukeador · May 28, 2020, 5:01pm

Are there other languages seeing people reading the footnote?

mbranson · May 29, 2020, 5:41pm

Thanks for the questions and input @dabinat!

For single digits, yes and no we’re aiming to collect 4k validated utterances (clips) from at minimum 350 unique speakers
For hey and Firefox we’re aiming to collect 2-4k validated utterances (clips) from at minimum 350 unique speakers

Yes, they do. This Single Word target segment is a part our of Common Voice Dataset and not a separate dataset itself. Therefore it adds to the overall collection numbers for the Common Voice Dataset.

Part of this work is determining how we represent progress toward segments as part of the dataset whole. Breaking this down by language is indeed another factor. Our current priority is collecting these clips, releasing the data and gathering insights before we make any sweeping changes to how progress is conveyed.

mbranson · May 29, 2020, 5:46pm

Totally agree. To our knowledge, digits have not been collected individually before and this data, as we’re collecting it individually, doesn’t exist (at least not openly). By creating this segment it provides a distinct dimension to test against for benchmarking accuracy. Perhaps we’ll learn that multi digit number sentences need to be captured as well. Thanks for the input!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · May 29, 2020, 7:33pm

I think the next branch of segments should target large numbers and names of places and people, since I think we could argue that names are CC0.

Okki · May 29, 2020, 9:08pm

We also recognize that listening to people say such short terms repeatedly may get boring and be mentally fatiguing. To avoid burnout, and ensure quality of contribution when listening to clips, each person will only receive a maximum of two sets (or 28 total) of these succinct recordings.

By default, I agree. But it would have been nice to have the opportunity to validate new clips at any time.

shybound · May 31, 2020, 8:09pm

hello, people in the common voice project, ive been recording clips for about a week under the language esperanto, i have a few questions about this: will you be creating more targeted sentences in the future? are low quality recordings considered succinct?

dabinat · May 31, 2020, 11:41pm

The most important factors are that the person says the sentence exactly as written and that it can be reasonably understood by listening (a good way of determining the latter is to close your eyes or look away so you don’t read the sentence before hearing it.)

So low quality recordings are fine as long as the person can still be understood.

nukeador · June 1, 2020, 5:28pm

Yes, please see Discussion of new guidelines for recording validation

nukeador · June 3, 2020, 5:35pm