[Legal] [Sentence extraction] Can I use Wikisource(CC0) for sentence collection

Oymate · May 8, 2021, 3:37pm

The Bangla Wikisource has a much bigger collection of CC0 sentences [https://dumps.wikimedia.your.org/bnwikisource/20210501/] than the wikipedia [https://dumps.wikimedia.your.org/bnwiki/20210501/] these articles are much more natural with higher review standard than wikipedia(as long as only reviewed text body is used).
So, I was wondering if there could be some help to extract the Wikisource data and some legal advice on if this is allowable.

mkohler · May 8, 2021, 3:44pm

I haven’t checked the dump, but the pages mention CC0 if applicable. Would be great if that was a flag in the dump as well, but would need to check that. For articles that are not marked as CC0 explicitly the standard CC-BY-SA applies and can’t be used in full.

@phire would you like to check with Legal if we could do the same “max 3 sentences per article” approach for pages that fall under the CC-BY-SA license?

Oymate · May 9, 2021, 1:22pm

I am thinking of multiple sentence per page from CC0 instead, because most data is CC0 and it’s easier legally.

irvin · May 10, 2021, 4:25am

is there any examples from bn wikisource that is under CC0? According to standard Wikisource license the contribution is default done under CC:BY-SA and GFDL which both is not CC0 compatible.

Oymate · May 10, 2021, 6:46am

Your right I guess only the pictures are in CC0, so do I have to get 3 sentence per paragraph or per body of work?

mkohler · May 10, 2021, 8:04pm

Please don’t use any of these sentences for now. Let’s wait on the official legal position. And even then we probably would need to integrate this into the Sentence Extractor to guarantee legal compliance.

phire · May 17, 2021, 11:07pm

Hey folks, thanks for the discussion here. I’ve verified with legal and they’re okay with using Wikisource the same way we use Wikipedia, i.e. as long as we extract no more than 3 sentences per article across the board. As @mkohler mentions, I would recommend you modify the existing sentence extraction tool to add Wikisource as a new scrape target. This is the best way for the internal team to have confidence that the legal requirements have been adhered to.

mkohler · May 18, 2021, 4:28pm

Thanks for the follow-up Jenny!

For clarification: does this also include articles which are explicitly marked as being CC0 in the US? In the end I haven’t looked at the dump yet, this might be hard to extract in general, but just wondering. In general 3 per article is definitely a safe thing due to the possible technical challenge to correctly identify CC0 content.

phire · May 18, 2021, 6:34pm

Yeah, I think to be on the safe side let’s stick with 3 sentences per articles across the board. Based on what I can tell public domain texts are tagged pretty minutely depending on why and where they’re considered PD, and I don’t want to risk accidentally including the full text of something we shouldn’t because there was a bug in tag parsing or something.

mkohler · May 18, 2021, 7:50pm

Agreed, thanks for the clarification.

I have filed https://github.com/common-voice/cv-sentence-extractor/issues/142 with open questions. Happy to help if somebody wants to look at the dump and WikiExtractor to see how similar it is.

Oymate · May 20, 2021, 11:19am

Thanks for the reply, now I would need to clarify something.
Is a whole book considered one article? Or one chapter considered one article? Especially regarding novels. And for plays are every part considered one article or the full play.

mkohler · May 20, 2021, 5:38pm

Good question! With WikiSource that indeed might be a bit more complicated. Does https://en.m.wikisource.org/wiki/Adventures_in_Contentment/I and https://en.m.wikisource.org/wiki/Adventures_in_Contentment/II count as different articles? I didn’t check the dump, but I’m fairly sure those would be two different entities in there, same as two completely different articles on Wikipedia. If that wouldn’t be okay we’d need to have additional checks in the Sentence Extractor to consider the URL and only look at the base path for an article and ignore anything else after it, even if it’s a different URL.

@Oymate as a side question, have you tried out the Wikipedia extraction for bn? Does that work nicely with the non-latin script?

mkohler · June 5, 2021, 10:08pm

I have integrated WikiSource into the Sentence Extractor. At least for the German rules file there would need to be adjustments, as the currently resulting quality of sentences would not be enough for an export right now. However that might look different for other languages.

Oymate · June 10, 2021, 9:04am

It is very bad to be honest, there’s occasional use of ‘.’(full-stop) to say this person has a degree(এম.এ.) only in these cases the default sentence extractor kicks in.

Topic		Replies	Views
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	624	August 12, 2024
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3736	September 11, 2019
Use of Wikipedia Sentences Common Voice sentence-collection	1	393	August 5, 2024
Scraping news sites/subtitles -- license question Common Voice sentence-collection	14	1507	September 26, 2021
Remove my Swedish sentence submissions from parliament proceedings Common Voice sentence-collection	2	798	June 30, 2020

[Legal] [Sentence extraction] Can I use Wikisource(CC0) for sentence collection

Related topics