Common voice sentences are the opposite of "common"

We collect some Chinese sentences from the log of a cc0 pre-defined chat room on local OSS community slack, which everyone knows that in this room all the chat will de-identified and turn into public domain materials.

Which language you are trying to contribute? Perhaps you and your local community can do a similar things.

2 Likes

@irvin how many new sentences you were able to collect? what was the time frame for collecting them?

I want to understand this in order to bring scale to the problem and figure out the most time-efficient strategies to keep growing our sentence dataset.

Thanks!

My understanding is that most STT products explicitly forbid you to use them to train other algorithms.

1 Like

this is the log of the channel, which is not quite popular, 2000 sentences in 10 months. The time I need to manual cleaning them is about 4 hours.

Yes it does need manual cleaning such as de-identify, remove non-public domain compatible part like un-embed contents and make sure each one is speakable sentences.

1 Like

Interesting, what do you think is the outcome of this effort compared with other ways to collect sentences manually, like the sentence collector events? (thinking on volume and time invested)

People just won’t continuance filling their sentences to the collector along many months, but they chat every day. So we make it really easy to keep contribute daily, by telling them that “we will collect the sentences from this channel for the public benefit.”

It just works. Participants feel happy with it, and we found a channel to keep engaging them with Common Voice all the time in their daily life. (I share monthly stats about corpus collection and voice record/validate to the chatroom)

It just one of the places that I collect sentences. It’s good to have multiple diverse sources.

3 Likes

I like the idea of scaping sentences from simple english wikipedia. English is the only language that can use two versions of wikipedia, we should use this if we havn’t done this yet. Simple english has around 150 000 articles.

4 Likes

I’ve had another idea. How about scraping Wiktionary’s usage examples? That would yield a bunch of sentences that are already agreed to be good examples of things people say.

https://en.wiktionary.org/wiki/Wiktionary:Example_sentences

2 Likes

Let’s elaborate a list of sources we are sharing here, together with their licenses.

  • Ideally we need public domain (CC-0) content.
  • If license is CC-BY (o other more permissive one) we’ll need to check with legal if we can do a similar process as wikipedia (analysis needs to be on case-by-case basis)
  • If the license is different, I would avoid considering the source.

I’ve had another thought about the original topic of this thread.

Some sentences are more difficult to read than others, and some people have better reading comprehension than others. It ought to be possible to infer both of these by looking at how often a sentence is skipped, and how likely a user is to skip a sentence compared to other users.

Start by giving the user a low comprehension level and feed them sentences that match that level, plus some variance and unknowns. As their comprehension level increases, they get more difficult sentences.

This way, a new user with a poor reading level won’t be immediately put off, and their voice and accent will be trained on sentences that they’re more likely to utter.

Re: licensing, from the above post, I guess the collector can set the terms on anything that’s explicitly opt-in, so IRC, WhatsApp, Facebook, Twitter or mailing list posts won’t be a problem. Wiktionary and simple Wikipedia are a different issue though.

2 Likes

I don’t see the problem in using un-common sentences when the corpora is used to train the acoustic model for DeepSpeech. Since the model doesn’t learn words but how a letter sounds like it will be a generic one anyways independent on which words it learned it from.

However, if the text was used for a language model it would be different.

Or am I complete off here?

Currently it’s only collecting the accents of people who have excellent reading comprehension. Anyone who actually uses STT because they are crap at reading isn’t having their voice collected.

This is a huge barrier to collecting the voices of children and the strong accents of the working class, they just won’t record sentences that they can’t read. No hillbillies, nobody from the ghetto, no underclass. Just a bunch of largely university educated, middle class people, with the only diversity provided by middle class ESL students.

So my complaint was that it’s not using common people at all. I don’t know how that will translate into recognition performance, but I doubt we’ll get feedback from those people anyway. Maybe it’ll be easier to get feedback from our children?

1 Like

Children voices is currently a legal limitation. About the rest, I agree, there are currently some of this open questions we should evaluate when checking the quality of our dataset as well as how models trained with it perform in the real world.

/cc @rosana because of the insightful comments about quality and diversity.

1 Like

What’s the legal issue with using children’s voices? Is it that you need permission from a third party (legal guardian)? Would it not be possible to get schools onboard, get them to use a separate site to submit clips? Or can the recognition models be composed, so the actual voice data isn’t shared, only the histograms that form the training data? That would remove GDPR/privacy liabilities, if not copyright ones.

In many countries the data of minors has more legal restrictions than those of adults.

By the way, I have an ongoing project to clean up the English wiki sentences, focusing mainly on removing difficult-to-pronounce foreign and scientific terms. Although I’ve really only just scratched the surface, it is slowly improving things.

2 Likes

See this topic for reference

1 Like

Here a true idealistic proposal for easy collecting common voice sentences.
Think about sharing Alexa voice history.
Ask Amazon through change.org to add a button “I want to donate my voice history to CommonVoice” or at least “download the archive of my voice history”.
Now you can actually only delete or play the history.
Is it possible to realize a browser addon able to download voice data from alexa account?

1 Like

Our current data is what’s called a “read speech corpus”, because each sentence is read from a prompt. It can also be very useful to have a “spontaneous speech corpus”. In this case, the speech is produced spontaneously, and the transcription is created later on by listening to it.

I have long thought that that the creation of a spontaneous corpus would also be an ideal application for crowd-sourcing, not sure if it could one day be included in the scope of this project. You would initially contribute a recording of your voice, then other users would verify and transcribe it. Your Alexa voice history might be good for that, or also just recording your side of (phone) conversations, etc.

4 Likes

You can get lots of casual english sentences from fandom wiki, which uses CC-SA. They’re mostly on media/entertainment topics.
https://www.fandom.com/licensing

Hey @twinfrosty, welcome. In Common Voice, only CC-0 sentences are allowed.

But a new Spontaneous Speech dataset creation application is under development, where people give spontaneous answers to prompts and then transcribe it.