Hello,
Some Wikipedia contributors publish their content under the Creative Commons Zero licence (see this page for example and the list of users who display this template on their Wikipedia personal pages).
I started to write a script which automatically retrieve this CC0 content from the French and English Wikipedia. I made sure to retrieve only contributions under CC0 licence (e.g. I excluded derivative works like translations and contributions which are mixed with other contributors’ content).
Here are some examples of retrieved sentences: in English and in French.
The script still needs a lot of improvement, so I was wondering if Wikipedia content is relevant enough for Common Voice and if working on this script is worth it. I think we could retrieve some hundreds or thousands of sentences this way.
Potential issues: spelling mistakes, the style of sentences, and the script execution time (for the moment, retrieving the content from one user may take between 30 min and 2 hours).
Thanks for your comments or suggestions! (ping @mhenretty)