Help create Common Voice's first target segment

@tucan welcome and thanks for that input. Sounds like you and @stergro are aligned re: the German preference for “Hey” over “Hei”. Appreciate that feedback. We’ll be sure the Firefox Voice team gets this info and work with @jofish to determine best way forward prior to our next release (June 9th). Update: This issue has been submitted to Common Voice github for transition of Hei to Hey for German collection. This will go out with the June 9th release.

Also @stergro it’s great that you’re taking on validating more via a private tab. For recording, it’s important that we have as many unique voices as possible, so please avoid recording these target segment words more than once.

1 Like

@mbranson Is there a particular goal or minimum in terms of number of contributors or hours recorded?

Do the progress bars on the Languages page count target segment contributions as part of the total? They shouldn’t if it’s a separate dataset.

Just a thought. I say digits a little differently when I am speaking them carefully one at a time, compared to reeling them off in a long multi digit number.

Are there other languages seeing people reading the footnote?

Thanks for the questions and input @dabinat!

  • For single digits, yes and no we’re aiming to collect 4k validated utterances (clips) from at minimum 350 unique speakers

  • For hey and Firefox we’re aiming to collect 2-4k validated utterances (clips) from at minimum 350 unique speakers

Yes, they do. This Single Word target segment is a part our of Common Voice Dataset and not a separate dataset itself. Therefore it adds to the overall collection numbers for the Common Voice Dataset.

Part of this work is determining how we represent progress toward segments as part of the dataset whole. Breaking this down by language is indeed another factor. Our current priority is collecting these clips, releasing the data and gathering insights before we make any sweeping changes to how progress is conveyed.

Totally agree. To our knowledge, digits have not been collected individually before and this data, as we’re collecting it individually, doesn’t exist (at least not openly). By creating this segment it provides a distinct dimension to test against for benchmarking accuracy. Perhaps we’ll learn that multi digit number sentences need to be captured as well. Thanks for the input!

I think the next branch of segments should target large numbers and names of places and people, since I think we could argue that names are CC0.

We also recognize that listening to people say such short terms repeatedly may get boring and be mentally fatiguing. To avoid burnout, and ensure quality of contribution when listening to clips, each person will only receive a maximum of two sets (or 28 total) of these succinct recordings.

By default, I agree. But it would have been nice to have the opportunity to validate new clips at any time.

hello, people in the common voice project, ive been recording clips for about a week under the language esperanto, i have a few questions about this: will you be creating more targeted sentences in the future? are low quality recordings considered succinct?

1 Like

The most important factors are that the person says the sentence exactly as written and that it can be reasonably understood by listening (a good way of determining the latter is to close your eyes or look away so you don’t read the sentence before hearing it.)

So low quality recordings are fine as long as the person can still be understood.

Yes, please see Discussion of new guidelines for recording validation

i see. will the common voice ai be used in the future to make a text to speech sort of thing?

Might not be good for TTS, but it’s helpful for Deepspeech/STT.

See #tts for the text to speech project :slight_smile:

Sorry if this is an obvious question – does one need to add the words listed here to the sentence collector if they want to see the notification saying " Help create Common Voice’s first target segment in XYZ language" and then speak/record?

Hey, in small languages often there seems to be a situation where two or three persons validate most of the sentences and many other persons donate 10-20 sentences and never come back to the website.

This creates a situation where the target segment records never shows up to the active validators and they get very slowly validated. At the same time validators get the message “no more sentences to validate left” while there is still work to do.

I understand why you limited the validation, but maybe it would be good to show the rest to a validator once he or she has reached the end of the validation queue.

It also looks like once the target segment is activated there are much less “normal” sentences recorded from new donors.

No need, sentences captures on Josh repo are later incorporated as a separate text corpus from the main one and the site detects when this corpus is present or not.

This usually takes some time because we only deploy changes to the site each 2 weeks, today is release day for example.

1 Like

to quote Josh himself; “That data doesn’t exist…yet.”

At least for english exists a very big dataset with 105,829 utterances of 35 words from 2,618 speakers. Every speaker should record the main words 5 times.

Download of latest version: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

You can still contribute at https://aiyprojects.withgoogle.com/open_speech_recording (it takes about 5 minutes. you have to talk fast, because it records just 1 second)

I like to see that Common Voice does a similar project, but in many languages.

I would like to record every word at once, even multiple times, in a random order.
I would even like to review the same word from different people. That should be very efficient. Can there be a pro mode where i can select such options?

A JS Console command that let’s the next item be a random one from this set would help.

I've mentioned it in Japanese language: Supplements for Digits, Counters, Popular readings, etc.:

  • Explain that it is a number.
    • For example, annotate the sentence card,
      • This is part of the first target segment [digits 0-9 / Yes / No / Hey / Firefox]
      • Please read the number.
    • The Japanese language is full of homonyms, and even when shown in hiragana, we can't tell it's a number.
      • Why hiragana? (Maybe that's to limit the way of reading.)
      • And does Common Voice want to collect different notations (e.g. hiragana, katakana, kanji)?
    • Even if kanji are used, for example, the "" is just a bar line, as you can see, and is indistinguishable from a symbol.
    • The speaker doesn't necessarily see this topic.
    • Well, sure, if we can pronounce it, that's fine, maybe. But we want people to be able to pronounce it in the sense of numbers, don't we?
  • Why are Heyヘイ and Firefoxファイアフォックス excluded? It can be pronounced in Japanese.
  • There is no most popular reading.
  • Some readings are minor.

Etc.

Despite being verified by native speakers, the current Japanese target segment seems unnatural.

I don't know about voice recognition systems, but wouldn't it be possible, for example, to display 0123456789 (Arabic numerals) to the speaker and link that voice to the text of 零一二三四五六七八九 (kanji) in the dataset?
Arabic numerals will definitely tell us it's a number. (If we can't hope for annotations on the sentence card.)
In fact, if it's a single-digit number, as long as we can link the pronunciation (voice), the notation of the language, and the Arabic numerals, it's not a problem, right?
(Of course, it might not work well in other languages. In any case, I'm not an expert in voice technology.)