Text Corpus Link Collection

I consider that the collection and validation of texts should be done by people who are fluent in that language.
So I might suggest a corpus other than my native language, but it should be validated by people fluent in that language to see if it is a valid corpus.

I've already had my share of painful experiences with Japanese collections, and I don't want anyone else to make the same mistake.


The White House

Copyright Policy | The White House:

Pursuant to federal law, government-produced materials appearing on this site are not copyright protected. The United States Government may receive and hold copyrights transferred to it by assignment, bequest, or otherwise.

Seriously? I'm not familiar with the law, but can we add this to the list of appropriate corpus?

2020-11-15: I didn't know that, but it seems that copyright doesn't apply to law statements, news reports, etc. in Japan either.

What about your language? It's not "conversational" by any means, but we can increase the number.

Internet Archive

Most of the resources may be useful. But there is also content that is in clear violation of rights. (For example, to narrow it down to Japanese, there are videos of famous anime and pictures of idols.)
If we can verify each one of these licenses, we may be able to add them to the list, but what do you think?

datos.bne.es

Spanish. I can't read.
Maybe it's CC0, but I don't know.
The Biblioteca Nacional de España comes up somewhat by searching for CC0.

Is there anyone who can look it up?
Are there any sentences that could be used for these resources?

Mozilla

I'm sure there are people here who are familiar with Mozilla, so I'm going to ask: Are there any CC0 works in Mozilla's public resources?

All I could find was an MDN code sample: Code samples and snippets - About MDN Web Docs - The MDN project | MDN

OSCAR

I mention this because I know some of you may want to add it to the list.

I've downloaded the Japanese file.
Some of the text had unique, identifiable sentences; a quick Google search shows that they were extracted from personal sites, corporate promotions, reports of charitable activities, porn sites, etc. There were also a lot of proper nouns (names of identifiable individuals, groups and works).

I have contacted OSCAR about this and am waiting to hear back. (What process did they use to get the text, to check if it's legitimate, etc.)
But whatever the reply, I will not add OSCAR to the list.

If you think it's appropriate, useful, and worthy of use in other languages, please add it to the list, with the Note field mentioning "Japanese files have concerns".

In Japanese

リストに追加したい人もいると思うので、触れておきます。

私は日本語ファイルをダウンロードしました。
コーパスの一部には、独特の、特定可能な文章がありました。Googleで検索してみると、それらは個人サイト、企業の宣伝、慈善団体の活動報告、ポルノサイト等から抽出されたことがわかります。固有名詞(特定可能な個人や集団、作品の名前)もたくさんありました。

私はこの件についてOSCARに問い合わせ、返事を待っているところです。
でも、どのような返事であれ、私はOSCARをリストに追加しません。

もし他の言語では適切で、有用であり、使用に値するというのであれば、「日本語ファイルには懸念がある」とNote欄に記載した上で、リストに追加して下さい。