[Legal] [Sentence extraction] Can I use Wikisource(CC0) for sentence collection

The Bangla Wikisource has a much bigger collection of CC0 sentences [https://dumps.wikimedia.your.org/bnwikisource/20210501/] than the wikipedia [https://dumps.wikimedia.your.org/bnwiki/20210501/] these articles are much more natural with higher review standard than wikipedia(as long as only reviewed text body is used).
So, I was wondering if there could be some help to extract the Wikisource data and some legal advice on if this is allowable.

1 Like

I havenā€™t checked the dump, but the pages mention CC0 if applicable. Would be great if that was a flag in the dump as well, but would need to check that. For articles that are not marked as CC0 explicitly the standard CC-BY-SA applies and canā€™t be used in full.

@phire would you like to check with Legal if we could do the same ā€œmax 3 sentences per articleā€ approach for pages that fall under the CC-BY-SA license?

I am thinking of multiple sentence per page from CC0 instead, because most data is CC0 and itā€™s easier legally.

1 Like

is there any examples from bn wikisource that is under CC0? According to standard Wikisource license the contribution is default done under CC:BY-SA and GFDL which both is not CC0 compatible.

Your right I guess only the pictures are in CC0, so do I have to get 3 sentence per paragraph or per body of work?

Please donā€™t use any of these sentences for now. Letā€™s wait on the official legal position. And even then we probably would need to integrate this into the Sentence Extractor to guarantee legal compliance.

2 Likes

Hey folks, thanks for the discussion here. Iā€™ve verified with legal and theyā€™re okay with using Wikisource the same way we use Wikipedia, i.e. as long as we extract no more than 3 sentences per article across the board. As @mkohler mentions, I would recommend you modify the existing sentence extraction tool to add Wikisource as a new scrape target. This is the best way for the internal team to have confidence that the legal requirements have been adhered to.

Thanks for the follow-up Jenny!

For clarification: does this also include articles which are explicitly marked as being CC0 in the US? In the end I havenā€™t looked at the dump yet, this might be hard to extract in general, but just wondering. In general 3 per article is definitely a safe thing due to the possible technical challenge to correctly identify CC0 content.

Yeah, I think to be on the safe side letā€™s stick with 3 sentences per articles across the board. Based on what I can tell public domain texts are tagged pretty minutely depending on why and where theyā€™re considered PD, and I donā€™t want to risk accidentally including the full text of something we shouldnā€™t because there was a bug in tag parsing or something.

2 Likes

Agreed, thanks for the clarification.

I have filed https://github.com/common-voice/cv-sentence-extractor/issues/142 with open questions. Happy to help if somebody wants to look at the dump and WikiExtractor to see how similar it is.

1 Like

Thanks for the reply, now I would need to clarify something.
Is a whole book considered one article? Or one chapter considered one article? Especially regarding novels. And for plays are every part considered one article or the full play.

Good question! With WikiSource that indeed might be a bit more complicated. Does https://en.m.wikisource.org/wiki/Adventures_in_Contentment/I and https://en.m.wikisource.org/wiki/Adventures_in_Contentment/II count as different articles? I didnā€™t check the dump, but Iā€™m fairly sure those would be two different entities in there, same as two completely different articles on Wikipedia. If that wouldnā€™t be okay weā€™d need to have additional checks in the Sentence Extractor to consider the URL and only look at the base path for an article and ignore anything else after it, even if itā€™s a different URL.

@Oymate as a side question, have you tried out the Wikipedia extraction for bn? Does that work nicely with the non-latin script?

I have integrated WikiSource into the Sentence Extractor. At least for the German rules file there would need to be adjustments, as the currently resulting quality of sentences would not be enough for an export right now. However that might look different for other languages.

1 Like

It is very bad to be honest, thereā€™s occasional use of ā€˜.ā€™(full-stop) to say this person has a degree(ą¦ą¦®.ą¦.) only in these cases the default sentence extractor kicks in.

1 Like