I didn’t see this addressed elsewhere, but would it be OK to extract sentences from public-domain books ( has a nice collection)? There is a huge number of sentences out there, and those with underrepresented words could be automatically selected for inclusion in the corpus.

(Rubén Martín) #2

If they are public domain (cc-0), yes. But just note some are very old books which tone and wording might sound unnatural for modern spoken languages and we want to make the voice donation an engaging experience.

Probably a good idea would be to have a script to randomly extract a selection of short sentences from books and do a quick manual pass to see how easy to read for a modern person they are.