Scraping news sites/subtitles -- license question

Followed the “3 sentences” practice that CV currently use with Wikipedia, does it legally acceptable to extract a very small amount of the content from each news report and subtitle file?

If yes, what is the suggested amount?

Also, “daily news” (report about facts) is considered a work without copyright in Thailand. Although there’s an uncertain area about what is just mere report about facts and what is already crossed the line and being considered an analysis (which is copyrightable), if we theoretically able to extract only the report about facts, can we use those sentences as public domain?

I’m asking this because there’s recently a discussion in Thai community about adding more sources and there’s a proposal of using scraped content from news sites and other sources like subtiles (from Open Parallel Corpus project)

So, the summarizes this, I think we have two questions:

  1. Does is ok to use the “daily news” that scraped from news sites? (some of us think it can be consistent as public domain, while the copyright notice on the websites may say otherwise)

  2. For copyrighted work, like articles and subtitles, does it ok to extract few % of them? What is the suggested %?

Thank you.

For now please make sure to only include 100% Public Domain sentences. GIven that there is uncertainty around the questions you asked, please do not use those for now.

@phire can you have a look here and possibly check with legal on if and how to proceed here? Thanks!


As I know Wikipedia is a special case that we had agreement with them on the 3 sentences rule. It doesn’t applies to other sources.

Subtitle is also very risk because it’s a type of adoption and are more likely to apply on the same copyright.

If you want to find public domain sources on news, you probably can check if there is any local rule similar to copyright-free on us gov work. If so than gov press release can be one of the good sources.

I see. So it’s actually a specific bilateral agreement between specific parties, and not something we can assume with other sources. Thank you.

  • should be read “considered as public domain”

(not “consistent”, sorry)

Yeah, given that it’s an international project we need to be really careful about copyright laws that only apply in one country, because there’s no way of knowing if a Thai news article is actually a syndication / translation of an item from a British/American/whatever newswire service that does have copyright and terms of service attached. As a rule, unless the source itself explicitly specifies that the text is public domain / CC0 we’re unlikely to accept it.

Out of curiosity, how this “3 sentences allowance” arrangement between Common Voice and Wikipedia actually works? Since the copyright of each Wikipedia contribution is belong to a Wikipedia contributor (licensed to the public under GFDL and CC-by-sa) and never got transferred to Wikipedia/Wikimedia (as an organization).

What kind of mechanism that allows the copyrighted work to get enter to public domain (before their copyright expiration)?

I think this arrangement is interesting, and if it’s possible to arrange something similar on the same legal basis, we may trying to look around for other crowdsource projects and contact the coordinator of the project for discussion. Thank you.

That decision predates me but I suspect it has something to do with fair use. @mbranson-new any insights?