Prompt design

cjbaker · November 22, 2017, 9:50pm

I was wondering how many unique sentence prompt texts are being used, and how the text was chosen. I downloaded the Common Voice source on GitHub and found some prompts in the directory server/data, totaling 10,253 unique prompts. Are these the ones actually being used on the website?

For comparison, the 1000 hour public domain Librispeech corpus (based on audio book data from the Librivox project) has several hundred thousand unique sentences (I think around 300,000).

I’ve heard that Common Voice is supposed to be 10,000 hours, and my understanding is that with modern acoustic models like RNNs, a large diversity of prompts would be required to prevent the model from memorizing the prompts (overfitting).

I’m really looking forward to seeing the results of this project, and I’m especially looking forward to more languages being included!

mhenretty · November 27, 2017, 11:33am

We are constantly looking for more prompts. 10K is about what we have right now indeed, and we need a lot more.

The problem is that we are releasing our data under cc-0, so the sentence source needs to be cc-0 as well. That’s why we are asking community members to help us gather sentences from their own material:

cjbaker · November 27, 2017, 5:16pm

Thanks for your reply. I notice that that thread only discusses prompts written by the contributor. Would public domain texts (from Project Gutenberg for example) also be an acceptable source?

mhenretty · December 4, 2017, 4:59pm

Hi @cjbaker, we are definitely looking for public domain texts sources, but not so much project Gutenberg as literature is actually not very easy to read out loud. Movie scripts, or chat messages are a much better resource.

mikenewman1 · December 11, 2017, 5:28pm

Thanks to everyone involved in this project. This has the potential to be extremely valuable, especially because I am not aware of any other project that is collecting desk-top audio from such a variety of hardware/accents/speakers etc.
However I just started analyzing the data, and I think the lack of variety in the prompts is going to be a real problem. As you mention, there are only about 10k distinct prompts. Some of these occur very frequently (up to 300 times). Just the top 200 most frequent prompts account for about 50k out of the total of 340k utterances.
This is going to cause problems with triphone/pentaphone coverage. You can see this in the following table, comparing this corpus to TED-LIUM2

Corpus             Mozilla     TED-LIUM2
Tokens                3.0M         2.2M
Triphones            14429        27039
Pentaphones         103364       861816

Even though TED-LIUM2 is a little smaller, it has almost twice the number of triphones, and way more pentaphones (which really tells you that the diversity of even the triphones we have is not very good).

I probably haven’t told you anything you don’t already know. I guess I am encouraging everyone to make the collection of more text a top priority.
Thanks again.

cjbaker · December 14, 2017, 2:59pm

That’s interesting to hear about movie scripts and chat messages being easier to read out loud. “Read speech” is often considered to be a different domain from spontaneous speech and conversational speech when doing automatic speech recognition.

I have a proposal for generating texts. I could write a script that generates sets of two (or three?) words from the pronunciation dictionary that have poorly covered triphones/pentaphones, and then users would make up a sentence with those two or three words–sort of like a reverse Mad Libs. If we were to do it in an ad hoc way on the forums here, we would need a way to assign a unique wordlist to each user. In making up these sentences, I wonder if it would be hard to put enough variety into the rest of the sentence (the syntax)? Would any applications get messed up by having such potentially unnatural sentences?

iveskins · December 15, 2017, 2:01pm

Hi Mike Newman. I have been contributing prompts to the project.
I just skimmed a few papers to find out about coarticulation, triphones and wider-than-triphone contexts like Pentaphones.
I would like to know if there is a way to test the diversity of phrases before I submit them to the project.
I assume you have been analysing the Text data? What tools are available to analyse texts to see the possible number of contextual features in them. Also would this be based on some assumed pronunciation models, or speaking style?
presumably a whole range of factors (pace, setting, ect) affect possible co-articulatory interactions of sounds in words as compared to their pronunciation recorded in the dictionary…

mikenewman1 · December 15, 2017, 3:43pm

Hi iveskins (not sure how to pronounce that)

I would not worry too much about the details. If we were doing TTS we would want to try to sweat the triphone or pentaphone coverage, and we might consider selecting individual phrases because they add unseen pentaphones. But for ASR as long as we have a good variety of text we should be OK. The most important rule of collecting training data is that it should look like the data you are testing on (in multiphone content, speaking style, accent, background noise, etc). Let the models figure everything else out.

For computing my statistics above I combined a number of publicly available dictionaries: CMUDICT, plus those that are included with the LibriSpeech and TED-LIUM2 distributions (see http://www.openslr.org/resources.php), and stripped out stress. I mostly ignored issues to do with OOV, multiple prons (pick the first one), cross-word vs. word-internal, etc. I also extracted all the text from the csv files (but I imagine the set of prompts being used is available elsewhere).
A short python script will compute everything very easily. (I could probably provide my script, if you needed it.)

I want to make one point about triphones/pentaphones. Modern ASR systems (except maybe CTC systems) use pentaphones. But even if your system uses triphones, you still want to compute pentaphone coverage, because this gives you a better idea of the variation in the data. Say you have 10,000 training examples of a particular triphone; they might all come from the same word, rather than from different words which happen to share a triphone (and you would definitely prefer this). Pentaphones don’t exactly compute this, but they capture some of it.

I hope this answers your questions.

iveskins · December 16, 2017, 11:48am

Hello, Mike.
Thanks for sharing your expertise and the openslr page. I was glad to find the ‘beep’ alternative to the CMU dictionary. If you had a script to share, it would certainly be interesting to see.
As for the prompts included in the project they are available here. The live prompts seem to change often. The git repository doesn’t match exactly the live server.

– (ajv.skɪnz)

bmilde · January 10, 2018, 7:22pm

I have calculated the statistics on unique sentences here (Common Voice v1 corpus design problems, overlapping train/test/dev sentences). There aren’t that many in the v1 release, around 7000. What’s worse is that the same sentences are used in dev/test so that they overlap nearly 100% with train.

James_Fortune · January 13, 2018, 6:06pm

@bmilde , look at my post, this should greatly help this problem once we add the thousands of lines I donated that are all unique, maybe you could help us filter it to make the voice corpus v2 better in that way? [Help Wanted] Write some nice, short sentences for people to read

bmilde · January 15, 2018, 8:00pm

@James_Fortune Nice! I’ve filtered your sentences automatically using some simple search patterns to exclude some utterances. So that most of the code and other very technical and difficult to pronounce stuff is gone. But it’s a very nice contribution of free to use sentences!

http://speech.tools/cc0-75k-conversations.filtered