[Help Wanted] Write some nice, short sentences for people to read

(Michael Henretty) #1

Due to multiplebugs in our current sentence collection, we are trying to diversify and refresh our sentences. We would love your help! If you would like to contribute to Common Voice sentences with your own writings, please put your sentences (one per line) in a publicly linkable document (eg. Pastebin), then add a comment to this thread with a link to those sentences.


  • please write everything yourself. don’t copy and paste from somewhere else.
  • you must agree to releasing your sentences to the public domain with a cc-0 license.
  • more than 50 sentences per link, but less than 500 please.
  • be nice, don’t use offensive language. we aren’t collecting that kinda material.
  • i’ll be reading each one, and i may remove some but i will let you know why.

Thanks for your help!

We want your feedback: Improving the sentence collection
We want your feedback: Improving the sentence collection
(Michael Henretty) #2

(Fred Trotter) #3

Hello. Thank you for an amazingly simple implementation of a wonderful idea. Rather than randomly including writing, you might consider using some already qualified public domain resources. There is a list of such resources here: https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources

Of course, while you could consider any resource from that page, I would ask that you specifically consider the inclusion of healthcare terms that included simple medical terminology. Not the kind of things that doctors say about their healthcare (that is too technical and specific) but the types of things that patients might like to read and or discuss in everyday terms.

An amazing resource for this that is written in lay terms is the Medline plus website. Not everything on Medline is public domain, but they specify what is covered and what is not here:

Note that the Medline encycolpedia is licensed content and is therefore not public domain. But the Health Topics are public domain:
As are the FAQ answers
And the medline plus magazine

Obviously, as a healthcare data journalist I have an ax to grind here, but there is a huge amount of english sentences here that are not medically contextual. For instance the sentence “The people who write the materials are the ones who decide if they are easy to read.” is found on one of the FAQ pages. Moreover, while the terms in Medline are intended to be “laymans terms” they include words like “Alzheimer” which are common enough words, that will likely have huge pronunciation differences.

I should note that the sections on women’s health topics in Medline are likely to include more sentences including female pronouns.

Given that you are interested in resources that are not medical, I would also suggest the Federal Register, which is also without copyright.
example text:

It should be relatively simple to run a script which removes all sentences that include the goggly-gook internal reference system and also acronyms. NIST, NASA, etc. Once that is done, this would be a huge corpus of sentences that should be composed of relatively simple english sentences. If you wanted to ensure that the sentences were even more “common language-full” you might simple exclude everything except the contents of the executive summaries of the articles, which are intended to be relatively jargon-free.

If that is still not enough, you should consider including the text of comments made to various regulations on regulations.gov. Most people are unaware that the comments that they make on regulations themselves become public domain. See here: https://www.regulations.gov/userNotice

This data is available via an API, and here is an example:


(Fabian Letsch) #4

Hey, i made 60 sentences, i hope you can use them.


I have not licensed them but just use them as if you made them yourself. :slight_smile:

Edit: i have added new lines for better readability, i hope that doesnt hurt.


Would be happy if these 50 sentences will be of any help to you:


Reading over the lists- wondering if we need to vary our sentence structure. There is a lot of SUBJECT>VERB>COMPLEMENT sentences. Also would love to hear more thoughts about slang, and popular terms like ‘selfies’ inclusion. Plus, notably terms like ‘homosexual’, most style guides recommend substituting ‘gay’ or lesbian instead. I would guess with slang if it appears in a standard dictionary it would be fine to include. Lastly, maybe you can suggest a character count range that would be good for the sentences. (Or I will try and figure one out from what we have so far. I noticed mine were longish.)

(Kevin Németh) #7

I agree to publish these sentences under the cc-0 license.

Thanks for creating this wonderful project, I hope it turns out to be great!

(Michael Henretty) #8

Alright I’ve added your sentences Anoian, Sonny, and Kusaha! Thank you for your help!

(Michael Henretty) #9

Yeah, we definitely want slang, as well as terms like gay, lesbian, and homosexual. More than just style, we need to make sure we cover all the words we can. In terms of sentence, I agree we need more diversity there as well.

For sentence sentence length, I think around 20 syllables or less is a reasonable length.


50 sentences:

(Michael Henretty) #11

Thank you!


I did some sentences. I’m not sure if they’re the kind of thing you need or if you’re still needing contributions, but I hope they’re of some use. I agree to releasing them to the public domain with a cc-0 licence


(Jeff Anderson) #13

Honestly, I find this a little troubling. A true language model would include all words, offensive or otherwise. As someone who is depending on voice recognition heavily to get work at my normal job done, and interact with friends and family, I find it very irksome that some words are not recognized as well. I feel censored. How would you feel if your keyboard refused to produce a word that you typed?

There certainly should be room for having offensive words in this language model set.


Wrote some sentences here and tried to include more slangs and female pronouns: https://pastebin.com/94aLhZxT


Here are some short common sentences. They each reside in multiple books on Project Gutenberg from a random collection of works I downloaded from different authors. https://pastebin.com/XRpZgdbw


Would also help for sentiment analysis.

(Tammi L. Coles) #17

Hi, @mhenretty

Because of @fred_trotter 's comment, I created some health-related sentences. https://pastebin.com/6Y797eNp. Let me know if you need more contributions.

(Mark) #18

I’ve created some sentences here - maybe not simple enough? Let me know, I’d be happy to supply more


(Kieran Drew) #19

Hopefully this helps.


I created an extra thread to discuss this topic:

Submitting text to be voiced and collecting submitted audio data?
Prompt design