Scraping news sites/subtitles -- license question

bact · April 24, 2021, 6:07am

Followed the “3 sentences” practice that CV currently use with Wikipedia, does it legally acceptable to extract a very small amount of the content from each news report and subtitle file?

If yes, what is the suggested amount?

Also, “daily news” (report about facts) is considered a work without copyright in Thailand. Although there’s an uncertain area about what is just mere report about facts and what is already crossed the line and being considered an analysis (which is copyrightable), if we theoretically able to extract only the report about facts, can we use those sentences as public domain?

I’m asking this because there’s recently a discussion in Thai community about adding more sources and there’s a proposal of using scraped content from news sites and other sources like subtiles (from Open Parallel Corpus project) https://m.facebook.com/groups/527601721545161/permalink/548317636140236/

–

So, the summarizes this, I think we have two questions:

Does is ok to use the “daily news” that scraped from news sites? (some of us think it can be consistent as public domain, while the copyright notice on the websites may say otherwise)
For copyrighted work, like articles and subtitles, does it ok to extract few % of them? What is the suggested %?

Thank you.

mkohler · April 24, 2021, 1:39pm

For now please make sure to only include 100% Public Domain sentences. GIven that there is uncertainty around the questions you asked, please do not use those for now.

@phire can you have a look here and possibly check with legal on if and how to proceed here? Thanks!

irvin · April 25, 2021, 5:44am

As I know Wikipedia is a special case that we had agreement with them on the 3 sentences rule. It doesn’t applies to other sources.

Subtitle is also very risk because it’s a type of adoption and are more likely to apply on the same copyright.

irvin · April 25, 2021, 5:47am

If you want to find public domain sources on news, you probably can check if there is any local rule similar to copyright-free on us gov work. If so than gov press release can be one of the good sources.
https://www.usa.gov/government-works

bact · April 26, 2021, 5:57am

I see. So it’s actually a specific bilateral agreement between specific parties, and not something we can assume with other sources. Thank you.

bact · April 26, 2021, 6:11am

should be read “considered as public domain”

(not “consistent”, sorry)

phire · April 26, 2021, 3:57pm

Yeah, given that it’s an international project we need to be really careful about copyright laws that only apply in one country, because there’s no way of knowing if a Thai news article is actually a syndication / translation of an item from a British/American/whatever newswire service that does have copyright and terms of service attached. As a rule, unless the source itself explicitly specifies that the text is public domain / CC0 we’re unlikely to accept it.

bact · April 27, 2021, 2:41am

Thanks.

Out of curiosity, how this “3 sentences allowance” arrangement between Common Voice and Wikipedia actually works? Since the copyright of each Wikipedia contribution is belong to a Wikipedia contributor (licensed to the public under GFDL and CC-by-sa) and never got transferred to Wikipedia/Wikimedia (as an organization).

What kind of mechanism that allows the copyrighted work to get enter to public domain (before their copyright expiration)?

I think this arrangement is interesting, and if it’s possible to arrange something similar on the same legal basis, we may trying to look around for other crowdsource projects and contact the coordinator of the project for discussion. Thank you.

phire · April 28, 2021, 5:20pm

That decision predates me but I suspect it has something to do with fair use. @mbranson-new any insights?

mishari · May 25, 2021, 7:53am

I would also like to know the details @mbranson-new

mbranson-new · May 27, 2021, 10:02pm

Hi all, this isn’t something I can remember as it’s been a while and this isn’t my domain of expertise (it was a decision made by the Mozilla Legal team).

I’ll circle this back to them (legal counsel) and loop in @Em.Lewis-Jong as well. EM, perhaps you can follow up here once we’ve gotten it clarified to make sure we’re all on the same page?

Em.Lewis-Jong · May 28, 2021, 2:17pm

It’s a really interesting question - I’ll come back as soon as I have an answer for you all!

Em.Lewis-Jong · June 3, 2021, 7:35pm

Hey! Thanks for your patience, though unfortunately I can’t be a lot of help on this occasion. I’ve spoken with our legal counsel and we don’t share our internal legal analyses. All I can say is that Wikipedia collectively releases its text under a license with the intention that it can be used for other purposes. Thank you for the question and contribution though - as always.

hatemyself · September 26, 2021, 8:17pm

I think yes, because this is daily news, isnt it?
but also pay attention that you must scrape data from website properly, pulling out only content

bozden · September 26, 2021, 8:48pm

Hi @hatemyself, according to the 7’th post here, the answer is no…

Topic		Replies	Views
Using BBC Igbo news sentences Common Voice	3	718	September 27, 2021
[Legal] [Sentence extraction] Can I use Wikisource(CC0) for sentence collection Common Voice learning	13	1089	June 10, 2021
Remove my Swedish sentence submissions from parliament proceedings Common Voice sentence-collection	2	798	June 30, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3736	September 11, 2019
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	624	August 12, 2024

Scraping news sites/subtitles -- license question

Related topics