Sentences from public-domain books

MayeulC · December 22, 2018, 1:53pm

I didn’t see this addressed elsewhere, but would it be OK to extract sentences from public-domain books (gutenberg.org has a nice collection)? There is a huge number of sentences out there, and those with underrepresented words could be automatically selected for inclusion in the corpus.

nukeador · December 26, 2018, 11:21am

If they are public domain (cc-0), yes. But just note some are very old books which tone and wording might sound unnatural for modern spoken languages and we want to make the voice donation an engaging experience.

Probably a good idea would be to have a script to randomly extract a selection of short sentences from books and do a quick manual pass to see how easy to read for a modern person they are.

Cheers.

Topic		Replies	Views
Licensing and contribution to Common Voice Common Voice sentence-collection	5	1570	June 12, 2019
Text Corpus Link Collection Common Voice sentence-collection	2	1697	November 15, 2020
Ideas for finding public domain text Common Voice sentence-collection	0	843	October 31, 2020
Are librivox contributions really being put into Common Voice? Common Voice participation , sentence-collection , feedback , issue , dataset	10	1052	September 7, 2023
Secretly Public Domain Common Voice sentence-collection	7	884	August 2, 2019

Sentences from public-domain books

Related topics