Extending our sentence collection capabilities

Hello everyone,

One of the most important components to build a strong dataset of voices is always being able to provide people with enough sentences to read in their language. Without this, voice collection is not possible, and as a team we have been putting in a lot of work to emphasize sentence collection since the launch of Common Voice.

Some background

To make the Common Voice dataset as useful as possible we have decided to only allow source text that is available under a Creative Commons (CC0) license. Using the CC0 standard means its more difficult to find and collect source text, but allows anyone to use the resulting voice data without usage restrictions or authorization from Mozilla. Ultimately, we want to make the multi-language dataset as useful as possible to everyone, including researchers, universities, startups, governments, social purpose organizations, and hobbyists.

In the early days, sentence collection was an immature process, we accepted sentences using different channels (email, github, discourse…), which lead to a heavy workload for staff and inconsistent quality checks, as well as additional work to clean up sentences that were not useful for the Deep Speech algorithm.

As a result of that, late last year we put in place a tool to centralize the sentence collection and review, automating a lot of the quality checks and establishing a workflow that ensured peer-review was done. We have already collected 424K sentences in 45 languages, of which 79K have been already validated and incorporated into the main site.

We have been hearing a lot of your feedback over the last few months and we acknowledge the limitations and difficulties this process has for some of you. We will keep working on improving the experience.

New approaches for a big challenge

During the past months, we have also been exploring alternatives to collect the volume of sentences needed to cover the voice collection demand. If we want to get to an initial 2,000 validated hours milestone for what we call a “minimum viable dataset”*, the math tells us that we’ll need at least 1.8M unique sentences (4s each on avg.) per language if we don’t want to have more than one recording on each one**. This is needed for Deep Speech model quality as we commented in our H1 roadmap. Even with all the amazing community contributions we’ve seen - time beats us here.

That’s why we’ve been looking into other big sources of sentences out there, and I’m happy to announce that our investigations and legal counsel provided us with a legitimate way to tap into one of the biggest and most important sources of information of our time. We are able to use sentences from Wikipedia as long as we don’t extract more than 3 random sentences per article. In this way we can use these sentences in the project as under fair-use copyright provisions. In case you’re wondering, we’ve let Wikimedia know about this.


We know Wikipedia won’t be able to cover our needs for all languages, we want to expand our investigation and provide communities with the resources to do the same with other sources. Additionally we will keep supporting the sentence collection community-driven approach as an important complementary way to submit, review and import sentences from other small sources or that have been manually submitted.


This new approach unlocks a potential huge source of already reviewed sentences, and we have been working on a way to automate this work for Chinese Mandarin (in Simplified characters) and English in the past weeks, leveraging our previous work on validation rules done by the sentence collector through community feedback. We have been generating per-language rules to make sure the extraction had less than 5-10% error rate and that sentences are complete and readable. We are working on ways for contributors to flag any issue with sentences displayed on the site. We will provide more details on quality control in upcoming communications.

Our plan for the future is to work with communities to help create more per-language rules. This way we will be able to extract at once the amount of sentences needed for voice collection. We also have plans to extend the extraction script capabilities in the future to support per-source rules, allowing us to plug in other big sources of text where we can legally do the same.

Please keep an eye on this channel, we will be engaging with communities in the coming weeks with more information.

Cheers.

Ruben, on behalf of the Common Voice Team


*the amount of data required to train an algorithm to hit acceptable quality for it to be considered trained and minimally functioning. In the case of DeepSpeech, the minimum viable dataset is 2K hours of validated speech with its accompanying text. To train a production speech-to-text system we estimate the upper bound of hours required is 10K.

**noting English has a head-start with other data sources, but this is the goal for the Common Voice data, which has the unique characteristic of a lot of speaker diversity


Update (July 24th 2019): The wiki extractor tool is now ready for technical testing.

8 Likes

That’s pretty crazy. Massive project. o.0

I would like to provide a quick update here.

Since our dev efforts are currently focused on making sure the Mandarin extraction is high quality (both for vendor and community use), we haven’t been able to invest on improving the script to support additional languages.

I know having way more sentences is a P1 for every language, and that’s why I’ve been supporting in parallel some work to, at least, make sure we document the script so we can share it with the community devs to play with it as soon as possible. I’m working with @fiji and @mkohler on this.

Thanks for your understanding.

I think a possible place to collect sentences would be translations from the Bible and Quran.
Jehovah’s witnesses is also a good resource, they have translated materials in 976 languages.

I haven’t contacted anyone yet, but that would be my next step.

On another note,

1.8M unique sentences (4s each on avg.) per language
I’m a little confused because I thought we only needed 5000 sentences, could you clarify this?
sen

5K is the minimum to start collecting voice, but I understand that can be confusing from the UI with no context. @mbranson I don’t know if there is a better way to signal minimum to start to minimum for the 2K hours.

@nukeador thanks for the flag. It is indeed difficult to convey all of the details within the UI of the current language page. This is something we should revisit, though I can’t promise a timeline. A short term recommendation would be to point the UI to a FAQ entry with further info regarding the language launch process and requirements. Longer term would be to explore how we bring this info to the site more clearly. How are you currently logging feedback like this? This is something we should put into our prioritization queue. cc @r_LsdZVv67VKuK6fuHZ_tFpg

1 Like

Good thinking. I would think that the Koran would be good for CSA (Classical Standard Arabic) and also as a set of additional pronunciations for each of the Arabic dialects. For example, a given word for ar-eg (Egyptian Arabic) should have words with both the Egyptian Arabic (often including, for example, a /o/ phoneme in addition to the “classic” pronunciation with a /u/ phoneme…and this on top of the various allophones.

The American English equivalent would involve not only “locator” with aspirated /t/, non-aspirated /t/, and flapped /t/, but also variations of locator and locator making a total of six acceptable pronunciations, and the /oU/ phoneme in the second set may be realized as [o], making nine pronunciations total!

As for the “Orthodox” Bible of Chirstianity (i.e., not Gnostic, Syriac or other) in English, the King James is kind of an old “gold standard.” Mis-translations aside, it can be good and is certainly free of copyright and has lots of names from Abraham to Luke.

However, a lot of words will be skipped by most Anglophones, even Christians (Canaan, Elohim) and shibboleths and guessed or spelling pronunciations abound today (Haggai as haggy-eye, Bethlehem as beth-lee-um).

Additionally, it is barely Early Modern English. It took it time (for “its”). Ye know the answer (you, plural, know the answer). Thou know the answer (You, singular, know the answer). I told thee the answer (I told you, singular, the answer).

My dog. Mine apple. Mine historical novel.
Thy dog. Thine apple. Thine historical novel.

However, there are huge old texts that are free and public domain, fiction and non-fiction.

Karl Marx’s Capital in three volumes. It is harder in concepts and in odd, German-style phrasing and long sentences. Except for the occasional quips and idioms from Latin or French, most words are easily understood.

The works of Karl Marx are easily found online in pure text form for download.

As I commented in different topics, using old books for sentences is not a good idea. Language doesn’t capture the day-to-day expressions we use in our lives and introduces expressions that are not usually easy to read for the average people.

The more modern and conversational, the better for our training models is.

Ideally, yes, we want things more conversational, but the sentences currently available are often broken English, malapropisms, egg corns or are nonsense. “It is been fifty years” instead of “It has been fifty years.” I hear voice donaters stumble or pause in these.

William Labov’s work and corpus and recordings of American accents would be great and conversational, but the copyright is there and the material is very $$$.

Working with what we do have, books may not be that bad, and there are unique benefits. In my case, speech to text would help in essay writing. If the system only knows “none of these are good” but cannot understand “none of the above is adequate for our purposes,” we cannot use it for dictation in an essay.

As it is, there are already many scientific sentences that are skewed strongky toward specific interests. Biology, chemistry, Japan, anime and videogames.

Wouldn’t book English at least be better than the instruction manual English currently in use?

Maybe, dificult to know without seeing the proposal. I would said yes, as long as it contains modern English.

Hi there, I definitely understand the preference for modern & conversational sentences, as well as CC0 sources.

Is Project Gutenberg in the conversation? They do lean towards older written works: “You will find the world’s great literature here, with focus on older works for which U.S. copyright has expired.” When I think of the project, I think of old fiction: HG Wells, Jules Verne, CS Lewis. Certainly not the most modern sources (for which I blame our monstrous system of copyright…), but certainly a large dataset (“over 59,000 free eBooks”) and public domain-friendly.

Even if the entire dataset is not useful, perhaps a few dozen or few hundred hand-picked sources would be valuable (e.g. the “most modern” ones).

edit: sorry, didn’t see your comment about “book English.”

Hi @nate, welcome to the Common Voice community!

Yes, in fact I know some communities have been extracting sentences from books there for a long time. The next step for the work we are doing today with wikipedia is to find other sources like this one where we can automate the work.

Cheers.

1 Like

Hi @nate, here’s a list of the Gutenberg books I’ve added so far:

Alice’s Adventures in Wonderland by Lewis Carroll
Through the Looking Glass by Lewis Carroll
The Thirty-Nine Steps by John Buchan
[Titles of some Christmas carols]
Pride and Prejudice by Jane Austen
Emma by Jane Austen
The Story of Doctor Dolittle by Hugh Lofting
The Pickwick Papers by Charles Dickens
Bleak House by Charles Dickens
Wuthering Heights by Emily Bronte
The Wonderful Wizard of Oz by L Frank Baum
Right Ho, Jeeves by PG Wodehouse
At the Villa Rose by AEW Mason
Sons and Lovers by DH Lawrence
Desiderata [poem] by Max Ehrmann
The Tale of Peter Rabbit by Beatrix Potter
A Collection of Beatrix Potter Stories [Gutenberg]
The Chronicles of Clovis by Saki
The Picture of Dorian Gray by Oscar Wilde
Cranford by Elizabeth Gaskell
The Beautiful and Damned by F Scott Fitzgerald
Little Women by Louisa May Alcott
The Invisible man by HG Wells
When the Sleeper Wakes by HG Wells
Huntingtower by John Buchan
Mike and Psmith by PG Wodehouse
Eight Cousins by Louisa May Alcott

There are many more available! If you have ideas for other books that you’d like me to upload, please say. The most useful ones are not too old, and include lots of informal dialogue.

I want to bring this to all your attention, I know some of you have been waiting for a long time. It took longer than we expected :stuck_out_tongue:

As a non-American contributor to Common Voice and Wikipedia, this project worries me.

Has the use of this tool been properly discussed? As fair use does not exist in most parts of the world, the resulting dataset would not be public domain in a lot of countries, for example in my home country the Netherlands. I actually want to use the resulting dataset as public domain. I don’t want to worry that the dataset gets contaminated with sources that are not public domain. Because of this project, I cannot do this anymore in my home country.

Besides that, American fair use law only permits very limited use of copyrighted material. I think copying 3 whole sentences from each and every article is way too much.

Has this project been cleared and analyzed by international copyright law experts? Where can I read the details? I originally posted this issue on GitHub, but they forwarded me to this place.

Hello Robin,
copyright is complicated in Europe, thats true. But I think that this will very likely work for thees reasons:

  • The legal department of Mozilla has looked over it and said it would be okay
  • The extraction contains less than 1% of the wikipedia. We don’t have fair use here (which we should) but we have a quote and sample right everywhere afaik.
  • We can assume that most sentences come from different authors. So it is not like stealing millions of sentences at once but using 1-10 sentences from a lot of persons that have published their sentences under a already pretty liberal license.

I think there is little potential to get any legal problems. About which scenario are you concerned concretely? Who could sue Common Voice and what would be the worst consequences? I believe the worst consequences could be that we have to mention that we use a sentence from wikipedia when we show it in the fronted.

Hi,

As I commented on github, the extraction process was consulted and validated with Mozilla legal team and also communicated to Wikipedia. Our dataset remains Public Domain worldwide. The process is described in this topic (max. 3 random sentences from each article)

If you have concrete concerns we can add them to a list and consult with our legal team in our next meeting.

Thanks for your feedback!

My main concern is usage outside of the United States. Has the legal team thought about non-American usage? The concept “fair use” does not exist in the Netherlands, my jurisdiction. Is the dataset still public domain in my jurisdiction?

1 Like

Robin, I’ll ask about your specific question again to our legal team in our next meeting. My understanding is that our dataset remains public domain (CC-0) worldwide.

1 Like

Hi again,

@Robin, after consulting again, Mozilla is confident we can offer this as CC0.

If you are not confident in this, we recommend that you contact a legal council in your country.

Cheers.

1 Like