Followed the “3 sentences” practice that CV currently use with Wikipedia, does it legally acceptable to extract a very small amount of the content from each news report and subtitle file?
If yes, what is the suggested amount?
Also, “daily news” (report about facts) is considered a work without copyright in Thailand. Although there’s an uncertain area about what is just mere report about facts and what is already crossed the line and being considered an analysis (which is copyrightable), if we theoretically able to extract only the report about facts, can we use those sentences as public domain?
I’m asking this because there’s recently a discussion in Thai community about adding more sources and there’s a proposal of using scraped content from news sites and other sources like subtiles (from Open Parallel Corpus project) https://m.facebook.com/groups/527601721545161/permalink/548317636140236/
–
So, the summarizes this, I think we have two questions:
Does is ok to use the “daily news” that scraped from news sites? (some of us think it can be consistent as public domain, while the copyright notice on the websites may say otherwise)
For copyrighted work, like articles and subtitles, does it ok to extract few % of them? What is the suggested %?
For now please make sure to only include 100% Public Domain sentences. GIven that there is uncertainty around the questions you asked, please do not use those for now.
@phire can you have a look here and possibly check with legal on if and how to proceed here? Thanks!
If you want to find public domain sources on news, you probably can check if there is any local rule similar to copyright-free on us gov work. If so than gov press release can be one of the good sources. https://www.usa.gov/government-works
phire
(Jenny Zhang (Lead Engineer, Common Voice))
7
Yeah, given that it’s an international project we need to be really careful about copyright laws that only apply in one country, because there’s no way of knowing if a Thai news article is actually a syndication / translation of an item from a British/American/whatever newswire service that does have copyright and terms of service attached. As a rule, unless the source itself explicitly specifies that the text is public domain / CC0 we’re unlikely to accept it.
Out of curiosity, how this “3 sentences allowance” arrangement between Common Voice and Wikipedia actually works? Since the copyright of each Wikipedia contribution is belong to a Wikipedia contributor (licensed to the public under GFDL and CC-by-sa) and never got transferred to Wikipedia/Wikimedia (as an organization).
What kind of mechanism that allows the copyrighted work to get enter to public domain (before their copyright expiration)?
I think this arrangement is interesting, and if it’s possible to arrange something similar on the same legal basis, we may trying to look around for other crowdsource projects and contact the coordinator of the project for discussion. Thank you.
1 Like
phire
(Jenny Zhang (Lead Engineer, Common Voice))
9
That decision predates me but I suspect it has something to do with fair use. @mbranson-new any insights?
Hi all, this isn’t something I can remember as it’s been a while and this isn’t my domain of expertise (it was a decision made by the Mozilla Legal team).
I’ll circle this back to them (legal counsel) and loop in @Em.Lewis-Jong as well. EM, perhaps you can follow up here once we’ve gotten it clarified to make sure we’re all on the same page?
Hey! Thanks for your patience, though unfortunately I can’t be a lot of help on this occasion. I’ve spoken with our legal counsel and we don’t share our internal legal analyses. All I can say is that Wikipedia collectively releases its text under a license with the intention that it can be used for other purposes. Thank you for the question and contribution though - as always.