[Help Wanted] Write some nice, short sentences for people to read

mhenretty · July 23, 2017, 6:11pm

Due to multiple … bugs in our current sentence collection, we are trying to diversify and refresh our sentences. We would love your help! If you would like to contribute to Common Voice sentences with your own writings, please put your sentences (one per line) in a publicly linkable document (eg. Pastebin), then add a comment to this thread with a link to those sentences.

Criteria:

please write everything yourself. don’t copy and paste from somewhere else.
you must agree to releasing your sentences to the public domain with a cc-0 license.
more than 50 sentences per link, but less than 500 please.
be nice, don’t use offensive language. we aren’t collecting that kinda material.
i’ll be reading each one, and i may remove some but i will let you know why.

Thanks for your help!

fred_trotter · July 24, 2017, 9:47am

Hello. Thank you for an amazingly simple implementation of a wonderful idea. Rather than randomly including writing, you might consider using some already qualified public domain resources. There is a list of such resources here: https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources

Of course, while you could consider any resource from that page, I would ask that you specifically consider the inclusion of healthcare terms that included simple medical terminology. Not the kind of things that doctors say about their healthcare (that is too technical and specific) but the types of things that patients might like to read and or discuss in everyday terms.

An amazing resource for this that is written in lay terms is the Medline plus website. Not everything on Medline is public domain, but they specify what is covered and what is not here:
https://medlineplus.gov/copyright.html

Note that the Medline encycolpedia is licensed content and is therefore not public domain. But the Health Topics are public domain:

As are the FAQ answers
https://medlineplus.gov/faq/disease.html
And the medline plus magazine

Obviously, as a healthcare data journalist I have an ax to grind here, but there is a huge amount of english sentences here that are not medically contextual. For instance the sentence “The people who write the materials are the ones who decide if they are easy to read.” is found on one of the FAQ pages. Moreover, while the terms in Medline are intended to be “laymans terms” they include words like “Alzheimer” which are common enough words, that will likely have huge pronunciation differences.

I should note that the sections on women’s health topics in Medline are likely to include more sentences including female pronouns.

Given that you are interested in resources that are not medical, I would also suggest the Federal Register, which is also without copyright.
example text:
https://www.gpo.gov/fdsys/pkg/FR-2015-01-02/html/2014-30754.htm

It should be relatively simple to run a script which removes all sentences that include the goggly-gook internal reference system and also acronyms. NIST, NASA, etc. Once that is done, this would be a huge corpus of sentences that should be composed of relatively simple english sentences. If you wanted to ensure that the sentences were even more “common language-full” you might simple exclude everything except the contents of the executive summaries of the articles, which are intended to be relatively jargon-free.

If that is still not enough, you should consider including the text of comments made to various regulations on regulations.gov. Most people are unaware that the comments that they make on regulations themselves become public domain. See here: https://www.regulations.gov/userNotice

This data is available via an API, and here is an example:
https://www.regulations.gov/document?D=VA-2016-VHA-0011-184061

HTH,
-FT

Anoian · July 25, 2017, 7:00pm

Hey, i made 60 sentences, i hope you can use them.

https://pastebin.com/1hwCTKq6

I have not licensed them but just use them as if you made them yourself.

Edit: i have added new lines for better readability, i hope that doesnt hurt.

sonny · July 25, 2017, 8:26pm

Would be happy if these 50 sentences will be of any help to you:
https://pastebin.com/HDv7uyeN

A_Wyman · July 25, 2017, 9:08pm

Reading over the lists- wondering if we need to vary our sentence structure. There is a lot of SUBJECT>VERB>COMPLEMENT sentences. Also would love to hear more thoughts about slang, and popular terms like ‘selfies’ inclusion. Plus, notably terms like ‘homosexual’, most style guides recommend substituting ‘gay’ or lesbian instead. I would guess with slang if it appears in a standard dictionary it would be fine to include. Lastly, maybe you can suggest a character count range that would be good for the sentences. (Or I will try and figure one out from what we have so far. I noticed mine were longish.)

Kusaha · July 25, 2017, 10:23pm

I agree to publish these sentences under the cc-0 license.
https://pastebin.com/jH7edz3m

Thanks for creating this wonderful project, I hope it turns out to be great!

mhenretty · July 28, 2017, 9:57am

Alright I’ve added your sentences Anoian, Sonny, and Kusaha! Thank you for your help!

mhenretty · July 28, 2017, 10:01am

Yeah, we definitely want slang, as well as terms like gay, lesbian, and homosexual. More than just style, we need to make sure we cover all the words we can. In terms of sentence, I agree we need more diversity there as well.

For sentence sentence length, I think around 20 syllables or less is a reasonable length.

nonthewiser · August 8, 2017, 12:41pm

50 sentences:
https://pastebin.com/XBtm9yUm

mhenretty · August 10, 2017, 12:53pm

Thank you!

Coffeebug · August 10, 2017, 1:46pm

I did some sentences. I’m not sure if they’re the kind of thing you need or if you’re still needing contributions, but I hope they’re of some use. I agree to releasing them to the public domain with a cc-0 licence

https://pastebin.com/xVSmuZB0

programmerq · August 10, 2017, 8:26pm

Honestly, I find this a little troubling. A true language model would include all words, offensive or otherwise. As someone who is depending on voice recognition heavily to get work at my normal job done, and interact with friends and family, I find it very irksome that some words are not recognized as well. I feel censored. How would you feel if your keyboard refused to produce a word that you typed?

There certainly should be room for having offensive words in this language model set.

bavencope · August 13, 2017, 8:41pm

Wrote some sentences here and tried to include more slangs and female pronouns: https://pastebin.com/94aLhZxT

pro.gadget · August 16, 2017, 2:17pm

Here are some short common sentences. They each reside in multiple books on Project Gutenberg from a random collection of works I downloaded from different authors. https://pastebin.com/XRpZgdbw

pro.gadget · August 16, 2017, 2:20pm

Would also help for sentiment analysis.

tlcoles · August 24, 2017, 4:02pm

Hi, @mhenretty

Because of @fred_trotter 's comment, I created some health-related sentences. https://pastebin.com/6Y797eNp. Let me know if you need more contributions.

mlennox · September 2, 2017, 9:32pm

I’ve created some sentences here - maybe not simple enough? Let me know, I’d be happy to supply more

https://pastebin.com/3wcQEnB8

Kieran_Drew · September 3, 2017, 4:06pm

Hopefully this helps.
https://pastebin.com/1VvGnUiV

jf99 · October 9, 2017, 5:57pm

I created an extra thread to discuss this topic:

Topic		Replies	Views
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	34	8975	December 17, 2018
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2328	March 15, 2023
About the new English Sentences Common Voice feedback , issue	37	3483	May 31, 2019
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3761	September 11, 2019
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14414	May 10, 2022

[Help Wanted] Write some nice, short sentences for people to read

Related topics