[Legal] [Sentence extraction] Can I use Wikisource(CC0) for sentence collection

mkohler · May 20, 2021, 5:38pm

Good question! With WikiSource that indeed might be a bit more complicated. Does https://en.m.wikisource.org/wiki/Adventures_in_Contentment/I and https://en.m.wikisource.org/wiki/Adventures_in_Contentment/II count as different articles? I didn’t check the dump, but I’m fairly sure those would be two different entities in there, same as two completely different articles on Wikipedia. If that wouldn’t be okay we’d need to have additional checks in the Sentence Extractor to consider the URL and only look at the base path for an article and ignore anything else after it, even if it’s a different URL.

@Oymate as a side question, have you tried out the Wikipedia extraction for bn? Does that work nicely with the non-latin script?

Topic		Replies	Views
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3679	September 11, 2019
Retrieving Wikipedia content under CC0 licence Common Voice sentence-collection	4	1932	August 9, 2018
Remove my Swedish sentence submissions from parliament proceedings Common Voice sentence-collection	2	783	June 30, 2020
Sentence Extraction now automated Common Voice	4	1317	March 19, 2020
Scraping news sites/subtitles -- license question Common Voice sentence-collection	14	1474	September 26, 2021

[Legal] [Sentence extraction] Can I use Wikisource(CC0) for sentence collection

Related topics