Help create Common Voice's first target segment

Totally agree. To our knowledge, digits have not been collected individually before and this data, as we’re collecting it individually, doesn’t exist (at least not openly). By creating this segment it provides a distinct dimension to test against for benchmarking accuracy. Perhaps we’ll learn that multi digit number sentences need to be captured as well. Thanks for the input!

I think the next branch of segments should target large numbers and names of places and people, since I think we could argue that names are CC0.

We also recognize that listening to people say such short terms repeatedly may get boring and be mentally fatiguing. To avoid burnout, and ensure quality of contribution when listening to clips, each person will only receive a maximum of two sets (or 28 total) of these succinct recordings.

By default, I agree. But it would have been nice to have the opportunity to validate new clips at any time.

hello, people in the common voice project, ive been recording clips for about a week under the language esperanto, i have a few questions about this: will you be creating more targeted sentences in the future? are low quality recordings considered succinct?

1 Like

The most important factors are that the person says the sentence exactly as written and that it can be reasonably understood by listening (a good way of determining the latter is to close your eyes or look away so you don’t read the sentence before hearing it.)

So low quality recordings are fine as long as the person can still be understood.

Yes, please see Discussion of new guidelines for recording validation

i see. will the common voice ai be used in the future to make a text to speech sort of thing?

Might not be good for TTS, but it’s helpful for Deepspeech/STT.

See #tts for the text to speech project :slight_smile:

Sorry if this is an obvious question – does one need to add the words listed here to the sentence collector if they want to see the notification saying " Help create Common Voice’s first target segment in XYZ language" and then speak/record?

Hey, in small languages often there seems to be a situation where two or three persons validate most of the sentences and many other persons donate 10-20 sentences and never come back to the website.

This creates a situation where the target segment records never shows up to the active validators and they get very slowly validated. At the same time validators get the message “no more sentences to validate left” while there is still work to do.

I understand why you limited the validation, but maybe it would be good to show the rest to a validator once he or she has reached the end of the validation queue.

It also looks like once the target segment is activated there are much less “normal” sentences recorded from new donors.

No need, sentences captures on Josh repo are later incorporated as a separate text corpus from the main one and the site detects when this corpus is present or not.

This usually takes some time because we only deploy changes to the site each 2 weeks, today is release day for example.

1 Like

to quote Josh himself; “That data doesn’t exist…yet.”

At least for english exists a very big dataset with 105,829 utterances of 35 words from 2,618 speakers. Every speaker should record the main words 5 times.

Download of latest version: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

You can still contribute at https://aiyprojects.withgoogle.com/open_speech_recording (it takes about 5 minutes. you have to talk fast, because it records just 1 second)

I like to see that Common Voice does a similar project, but in many languages.

I would like to record every word at once, even multiple times, in a random order.
I would even like to review the same word from different people. That should be very efficient. Can there be a pro mode where i can select such options?

A JS Console command that let’s the next item be a random one from this set would help.

I've mentioned it in Japanese language: Supplements for Digits, Counters, Popular readings, etc.:

  • Explain that it is a number.
    • For example, annotate the sentence card,
      • This is part of the first target segment [digits 0-9 / Yes / No / Hey / Firefox]
      • Please read the number.
    • The Japanese language is full of homonyms, and even when shown in hiragana, we can't tell it's a number.
      • Why hiragana? (Maybe that's to limit the way of reading.)
      • And does Common Voice want to collect different notations (e.g. hiragana, katakana, kanji)?
    • Even if kanji are used, for example, the "" is just a bar line, as you can see, and is indistinguishable from a symbol.
    • The speaker doesn't necessarily see this topic.
    • Well, sure, if we can pronounce it, that's fine, maybe. But we want people to be able to pronounce it in the sense of numbers, don't we?
  • Why are Heyヘイ and Firefoxファイアフォックス excluded? It can be pronounced in Japanese.
  • There is no most popular reading.
  • Some readings are minor.

Etc.

Despite being verified by native speakers, the current Japanese target segment seems unnatural.

I don't know about voice recognition systems, but wouldn't it be possible, for example, to display 0123456789 (Arabic numerals) to the speaker and link that voice to the text of 零一二三四五六七八九 (kanji) in the dataset?
Arabic numerals will definitely tell us it's a number. (If we can't hope for annotations on the sentence card.)
In fact, if it's a single-digit number, as long as we can link the pronunciation (voice), the notation of the language, and the Arabic numerals, it's not a problem, right?
(Of course, it might not work well in other languages. In any case, I'm not an expert in voice technology.)