Help create Common Voice's first target segment

mbranson · May 26, 2020, 11:02pm

@tucan welcome and thanks for that input. Sounds like you and @stergro are aligned re: the German preference for “Hey” over “Hei”. Appreciate that feedback. We’ll be sure the Firefox Voice team gets this info and work with @jofish to determine best way forward prior to our next release (June 9th). Update: This issue has been submitted to Common Voice github for transition of Hei to Hey for German collection. This will go out with the June 9th release.

Also @stergro it’s great that you’re taking on validating more via a private tab. For recording, it’s important that we have as many unique voices as possible, so please avoid recording these target segment words more than once.

dabinat · May 27, 2020, 6:43pm

@mbranson Is there a particular goal or minimum in terms of number of contributors or hours recorded?

Do the progress bars on the Languages page count target segment contributions as part of the total? They shouldn’t if it’s a separate dataset.

iveskins · May 28, 2020, 11:44am

Just a thought. I say digits a little differently when I am speaking them carefully one at a time, compared to reeling them off in a long multi digit number.

nukeador · May 28, 2020, 5:01pm

Are there other languages seeing people reading the footnote?

mbranson · May 29, 2020, 5:41pm

Thanks for the questions and input @dabinat!

For single digits, yes and no we’re aiming to collect 4k validated utterances (clips) from at minimum 350 unique speakers
For hey and Firefox we’re aiming to collect 2-4k validated utterances (clips) from at minimum 350 unique speakers

Yes, they do. This Single Word target segment is a part our of Common Voice Dataset and not a separate dataset itself. Therefore it adds to the overall collection numbers for the Common Voice Dataset.

Part of this work is determining how we represent progress toward segments as part of the dataset whole. Breaking this down by language is indeed another factor. Our current priority is collecting these clips, releasing the data and gathering insights before we make any sweeping changes to how progress is conveyed.

mbranson · May 29, 2020, 5:46pm

Totally agree. To our knowledge, digits have not been collected individually before and this data, as we’re collecting it individually, doesn’t exist (at least not openly). By creating this segment it provides a distinct dimension to test against for benchmarking accuracy. Perhaps we’ll learn that multi digit number sentences need to be captured as well. Thanks for the input!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · May 29, 2020, 7:33pm

I think the next branch of segments should target large numbers and names of places and people, since I think we could argue that names are CC0.

Okki · May 29, 2020, 9:08pm

We also recognize that listening to people say such short terms repeatedly may get boring and be mentally fatiguing. To avoid burnout, and ensure quality of contribution when listening to clips, each person will only receive a maximum of two sets (or 28 total) of these succinct recordings.

By default, I agree. But it would have been nice to have the opportunity to validate new clips at any time.

shybound · May 31, 2020, 8:09pm

hello, people in the common voice project, ive been recording clips for about a week under the language esperanto, i have a few questions about this: will you be creating more targeted sentences in the future? are low quality recordings considered succinct?

dabinat · May 31, 2020, 11:41pm

The most important factors are that the person says the sentence exactly as written and that it can be reasonably understood by listening (a good way of determining the latter is to close your eyes or look away so you don’t read the sentence before hearing it.)

So low quality recordings are fine as long as the person can still be understood.

nukeador · June 1, 2020, 5:28pm

Yes, please see Discussion of new guidelines for recording validation

nukeador · June 3, 2020, 5:35pm

shybound · June 4, 2020, 3:53am

i see. will the common voice ai be used in the future to make a text to speech sort of thing?

baconator · June 4, 2020, 4:38am

Might not be good for TTS, but it’s helpful for Deepspeech/STT.

nukeador · June 4, 2020, 2:40pm

See #tts for the text to speech project

psubhashish · June 9, 2020, 6:34am

Sorry if this is an obvious question – does one need to add the words listed here to the sentence collector if they want to see the notification saying " Help create Common Voice’s first target segment in XYZ language" and then speak/record?

stergro · June 9, 2020, 10:47am

Hey, in small languages often there seems to be a situation where two or three persons validate most of the sentences and many other persons donate 10-20 sentences and never come back to the website.

This creates a situation where the target segment records never shows up to the active validators and they get very slowly validated. At the same time validators get the message “no more sentences to validate left” while there is still work to do.

I understand why you limited the validation, but maybe it would be good to show the rest to a validator once he or she has reached the end of the validation queue.

It also looks like once the target segment is activated there are much less “normal” sentences recorded from new donors.

nukeador · June 9, 2020, 10:34am

No need, sentences captures on Josh repo are later incorporated as a separate text corpus from the main one and the site detects when this corpus is present or not.

This usually takes some time because we only deploy changes to the site each 2 weeks, today is release day for example.

davidak · July 16, 2020, 11:34pm

to quote Josh himself; “That data doesn’t exist…yet.”

At least for english exists a very big dataset with 105,829 utterances of 35 words from 2,618 speakers. Every speaker should record the main words 5 times.

Download of latest version: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

You can still contribute at https://aiyprojects.withgoogle.com/open_speech_recording (it takes about 5 minutes. you have to talk fast, because it records just 1 second)

I like to see that Common Voice does a similar project, but in many languages.

I would like to record every word at once, even multiple times, in a random order.
I would even like to review the same word from different people. That should be very efficient. Can there be a pro mode where i can select such options?

A JS Console command that let’s the next item be a random one from this set would help.

sinumade · November 14, 2020, 5:44am

I've mentioned it in Japanese language: Supplements for Digits, Counters, Popular readings, etc.:

Explain that it is a number.
- For example, annotate the sentence card,
  - This is part of the first target segment [digits 0-9 / Yes / No / Hey / Firefox]
  - Please read the number.
- The Japanese language is full of homonyms, and even when shown in hiragana, we can't tell it's a number.
  - Why hiragana? (Maybe that's to limit the way of reading.)
  - And does Common Voice want to collect different notations (e.g. hiragana, katakana, kanji)?
- Even if kanji are used, for example, the "一" is just a bar line, as you can see, and is indistinguishable from a symbol.
- The speaker doesn't necessarily see this topic.
- Well, sure, if we can pronounce it, that's fine, maybe. But we want people to be able to pronounce it in the sense of numbers, don't we?
Why are Heyヘイ and Firefoxファイアフォックス excluded? It can be pronounced in Japanese.
- ref. common-voice/singleword-benchmark.txt
There is no most popular reading.
Some readings are minor.

Etc.

Despite being verified by native speakers, the current Japanese target segment seems unnatural.

I don't know about voice recognition systems, but wouldn't it be possible, for example, to display 0123456789 (Arabic numerals) to the speaker and link that voice to the text of 零一二三四五六七八九 (kanji) in the dataset?
Arabic numerals will definitely tell us it's a number. (If we can't hope for annotations on the sentence card.)
In fact, if it's a single-digit number, as long as we can link the pronunciation (voice), the notation of the language, and the Arabic numerals, it's not a problem, right?
(Of course, it might not work well in other languages. In any case, I'm not an expert in voice technology.)