[Legal] [Sentence extraction] Can I use Wikisource(CC0) for sentence collection

The Bangla Wikisource has a much bigger collection of CC0 sentences [https://dumps.wikimedia.your.org/bnwikisource/20210501/] than the wikipedia [https://dumps.wikimedia.your.org/bnwiki/20210501/] these articles are much more natural with higher review standard than wikipedia(as long as only reviewed text body is used).
So, I was wondering if there could be some help to extract the Wikisource data and some legal advice on if this is allowable.

1 Like

I haven’t checked the dump, but the pages mention CC0 if applicable. Would be great if that was a flag in the dump as well, but would need to check that. For articles that are not marked as CC0 explicitly the standard CC-BY-SA applies and can’t be used in full.

@phire would you like to check with Legal if we could do the same “max 3 sentences per article” approach for pages that fall under the CC-BY-SA license?

I am thinking of multiple sentence per page from CC0 instead, because most data is CC0 and it’s easier legally.

1 Like

is there any examples from bn wikisource that is under CC0? According to standard Wikisource license the contribution is default done under CC:BY-SA and GFDL which both is not CC0 compatible.

Your right I guess only the pictures are in CC0, so do I have to get 3 sentence per paragraph or per body of work?

Please don’t use any of these sentences for now. Let’s wait on the official legal position. And even then we probably would need to integrate this into the Sentence Extractor to guarantee legal compliance.


Hey folks, thanks for the discussion here. I’ve verified with legal and they’re okay with using Wikisource the same way we use Wikipedia, i.e. as long as we extract no more than 3 sentences per article across the board. As @mkohler mentions, I would recommend you modify the existing sentence extraction tool to add Wikisource as a new scrape target. This is the best way for the internal team to have confidence that the legal requirements have been adhered to.

Thanks for the follow-up Jenny!

For clarification: does this also include articles which are explicitly marked as being CC0 in the US? In the end I haven’t looked at the dump yet, this might be hard to extract in general, but just wondering. In general 3 per article is definitely a safe thing due to the possible technical challenge to correctly identify CC0 content.

Yeah, I think to be on the safe side let’s stick with 3 sentences per articles across the board. Based on what I can tell public domain texts are tagged pretty minutely depending on why and where they’re considered PD, and I don’t want to risk accidentally including the full text of something we shouldn’t because there was a bug in tag parsing or something.


Agreed, thanks for the clarification.

I have filed https://github.com/common-voice/cv-sentence-extractor/issues/142 with open questions. Happy to help if somebody wants to look at the dump and WikiExtractor to see how similar it is.

1 Like

Thanks for the reply, now I would need to clarify something.
Is a whole book considered one article? Or one chapter considered one article? Especially regarding novels. And for plays are every part considered one article or the full play.

Good question! With WikiSource that indeed might be a bit more complicated. Does https://en.m.wikisource.org/wiki/Adventures_in_Contentment/I and https://en.m.wikisource.org/wiki/Adventures_in_Contentment/II count as different articles? I didn’t check the dump, but I’m fairly sure those would be two different entities in there, same as two completely different articles on Wikipedia. If that wouldn’t be okay we’d need to have additional checks in the Sentence Extractor to consider the URL and only look at the base path for an article and ignore anything else after it, even if it’s a different URL.

@Oymate as a side question, have you tried out the Wikipedia extraction for bn? Does that work nicely with the non-latin script?

I have integrated WikiSource into the Sentence Extractor. At least for the German rules file there would need to be adjustments, as the currently resulting quality of sentences would not be enough for an export right now. However that might look different for other languages.

1 Like

It is very bad to be honest, there’s occasional use of ‘.’(full-stop) to say this person has a degree(এম.এ.) only in these cases the default sentence extractor kicks in.

1 Like