Retrieving Wikipedia content under CC0 licence


(Jean-Baptiste Bertrand) #1


Some Wikipedia contributors publish their content under the Creative Commons Zero licence (see this page for example and the list of users who display this template on their Wikipedia personal pages).

I started to write a script which automatically retrieve this CC0 content from the French and English Wikipedia. I made sure to retrieve only contributions under CC0 licence (e.g. I excluded derivative works like translations and contributions which are mixed with other contributors’ content).

Here are some examples of retrieved sentences: in English and in French.

The script still needs a lot of improvement, so I was wondering if Wikipedia content is relevant enough for Common Voice and if working on this script is worth it. I think we could retrieve some hundreds or thousands of sentences this way.

Potential issues: spelling mistakes, the style of sentences, and the script execution time (for the moment, retrieving the content from one user may take between 30 min and 2 hours).

Thanks for your comments or suggestions! (ping @mhenretty)

(Jean-Baptiste Bertrand) #2

I updated the script to make it easier to use for other people, you can see it here. You’ll need this script too. It still needs improvements.

Basic usage:
~python <language code> <output directory>
~python "en" "C:\Users\Jean-Baptiste\Documents\WP_CC0_content"
Then you wait for some hours, depending on the number of contributions to retrieve. It creates one file per Wikipedia contributor. Hope it will be useful somehow!

(Andrew) #3

This is really cool! I assume this constitutes as public domain, any reason to anyone why it wouldn’t? If it does then I don’t mind throwing in a little development time to help out with the script.

(Jean-Baptiste Bertrand) #4

Some help would be great!

I think the licence is OK. My doubts are mainly about the style of sentences, because I heard that we’re looking for sentences with conversational style, and Wikipedia articles are not really written in that style. Is that a problem and if so, how should we address it?

Other than that, I think the main improvement we can make on the script is to clean up sentences, i.e. to make the script not retrieving “garbage”, like sentences extracted from tables, etc. Out of context, these sentences will appear weird for people who will read them. But this kind of content is pretty easy to detect and exclude I think. Here’s an example of the kind of sentence I’m talking about:

Genre = Melodic Death Metal | Length = sixty-seven min twenty-nine sec | Label = Spinefarm Records | Producer = ???

Another question I have is about abbreviations, which appear quite often in Wikipedia articles. For example: distances (e.g. “3000 km”), weights (“500 kg”), temperatures (“25°C”), or coordinates (22°12’5’’).

Should the script convert these abbreviations (km, kg, °C, etc.) to full text? (e.g. “kilometers”, “kilograms”, “degrees”, etc.).

Hope someone from the Mozilla team will be able to answer these questions!

(Michael Henretty) #5

This is awesome work @J-b and will be hugely impactful. I’d love to help you carry this forward. I’ll answer some of your questions below:

Agreed, the license is great, thank you for writing this tool!

And yes we are definitely looking for more conversational style text. That said, this wikipedia text will have some advantages, mainly that it will contain some proper nouns and terminology we so far have missed. Several thousand (maybe even 10K) sentences from a variety of wikipedia articles only helps the dataset.

More generally, we want our sentence collections in various languages to come from a variety of sources. If we can include wikipedia, that’s awesome. But we shouldn’t base any language on wikipedia if we can help it.

Yup, best would be if we can have these fully written out (ie. Celsius, kilograms, degrees, etc.).

Also note, any scripts you write would definitely be useful for our language collection tool. So please continue to share your work with us!