Text Corpus Link Collection

sinumade · November 15, 2020, 4:38am

Collection of links to the text corpus. Add to it freely.
Even if we don’t have a corpus in your language, it is in the public domain and we can translate it. Translation is not easy, but it can be a good alternative .
Of course, it will also help those who use it for purposes other than Common Voice.

Sentence collection is the origin of voice recordings and datasets and is an important part of Common Voice. Please share the corpus that you know and help all .

To start collecting voices, it requires 5,000 validated sentences.
Proper training of the system requires 1,800,000 sentences (for 2,000 hours of voice data).
ref. My language is now collecting voice, what do I need to know?
ref. Mozilla Voice Community Playbook
How to find public domain texts:
- How-to on Common Voice Sentence Collector
- Ideas for finding public domain text

Before adding the corpus to the list, please search this wiki. Press Ctrl and F on your keyboard to bring up your browser’s search bar.
Corpuses should be lined up by language to make them easier to find. Individual corpus should be added at the bottom .
If you don’t know if you can add a corpus to the list, please ask.
Languages that are not on Common Voice can be added to the list, but it is recommended that you post a topic requesting a language to be added to the Common Voice category first. See the topic Readme: How to see my language on Common Voice.
Check the license carefully . Especially if:
- the author and publisher of the text are different.
- a large corpus.
- there is more than one author.
If there is a copyright issue, the sentence will be removed from the sentence collection. Voice data may become unusable as well. Everyone’s efforts will be for naught.
Even if you are not interested in exploring, collecting, or validating the corpus, confirming that it is in the proper public domain will make a great contribution. Please be the “gatekeeper” for everyone’s efforts.

How to fill each field

Leave blank each field where you don’t know the details.

1. Corpus

Link to the corpus.

Link to the content we (will) collect .
- If a list exists, link to the root of the list (a page with full view of the public domain works).
- If only a portion of the content is in the public domain, mention that in the Note field.
Be precise and concise with the name of the corpus.
- If you don’t know what it is, write the title of the page. (Page headings, browser title bar, etc.)
- If there is a specific version of the corpus, state that as well.
- e.g. The Sinumade Book of Adventures (2020 edition)

2. Language

The language must be written as indicated in the Sentence Collector. For example, Chinese is Chinese for any region.

If there is more than one, separate them with a comma . They are written in alphabetical order. Example: English, French, German
If the language is not in the Sentence Collector, put a + sign on the name of the language. Example: English+

3. State

If possible, mark the following:

CC0 : The corpus text indicates permission. Or, it links to a document that indicates permission.
PD : Public Domain. It is mainly assumed to be a work whose copyright has expired. If the copyright holder has waived the rights to the work, make it CC0 .

4. Permission

A link to a document that indicates the corpus permission .

Related documents other than the permission should be written in the Note field.

5. Note

What to consider about collection. For example, there are limitations on collection (e.g., only part of it can be collected) or that it needs to be edited.

Appropriate corpus

A corpus that has been confirmed to be in the public domain.

As a general rule, only content created by the author must be collected . Except for content that is quoted, reproduced from other sources, or otherwise mentioned. The distinction must be made by a real person .
- In most cases, blockquote and q elements are used in HTML to indicate a citation . (In some cases, these elements are not used.)
If you have contacted the rights holder and received permission, mark Contacted in the Note field. If possible, include the date of permission. For example,
- Contacted 2020-10-25
Whenever possible, link to the document that indicates permission.
Before collecting , it is recommended that:
- you pull out a few lines at random and search for it on Google (the search engine with the most hits for that language). See if the corpus is illegal.
- we copy the work in Wayback Machine, as the author may delete it.

Corpus	Language	State	Permission	Note
Wikipedia				There are limitations . See: Sentence Extractor - Current Status and Workflow Summary
GitHub - irvin/cc0-sentences	Chinese	CC0
mlog	English	CC0	Everything by me – Happy GNU Year & Public Domain Day – mlog
zen habits	English	CC0	Uncopyright : zen habits	Leo’s ebooks are also in the public domain.
mnmlist	English	CC0	» uncopyright mnmlist
星空文庫	Japanese	CC0		Only Public Domain category can be collected
deztec.jp	Japanese	CC0	Info／趣味のWebデザイン	Contacted 2020-10-28
死ぬまで憶えておいて	Japanese	CC0	sinumade.net 槪要	Written in minor wording (Historical kana orthography and Kyūjitai) and need to be edited
青空文庫	Japanese	PD		Contacted 2020-11-14 / Only out of copyright works can be collected / See: Post #3

Candidate corpus (DO NOT USE this corpus)

A corpus that has not been confirmed to be in the public domain.

It is necessary to contact the rights holder to obtain permission to use this corpus.
- Move the permitted corpus to the Appropriate corpus . Mark Contacted in the Note field.
- Move the non-permitted corpus to the Invalid corpus . Mark non-permitted in the Note field.

Corpus	Language	State	Permission	Note

Invalid corpus (DO NOT USE this corpus)

A corpus that must not be used .
For example, a corpus that was used but found to be inappropriate.

If possible, state in the Note field why it is invalid.
- If you couldn’t get permission from the right holder, mark non-permitted in the Note field.
- If there is a problem with the corpus, mark Problem in the Note field and be specific and concise. (Details should be available for reference on a separate document.)

Corpus	Language	State	Permission	Note
Tanaka Corpus (Public Domain version)	English, Japanese			Problem : copyright issues: Post #11 on Sentence collector copyright issues / It was in the Japanese language collection until October 2020.

Supplement

Offline resources can also be added to the list. If so, please mark Offline in the Note field and include how to access it. For example,
- Offline : Stored in the sinumade library. Reader’s card required to view.
The wiki can be edited when the Discourse trust level is Member or higher. Otherwise, please share the information in a reply. Someone else will add it to the wiki.
- Please include the information for each field in your reply. A list is fine. For example,
  1. Corpus :
  2. Language :
  3. State :
  4. Permission :
  5. Note :
- If the information is only a name or a link, those adding to the wiki should check the corpus and fill in the fields.
- ref. What is a Wiki Post? - howto / faq - Discourse Meta
How to write a table:
- Markdown (current): Create a table using markdown on your Discourse forum - howto / tips & tricks - Discourse Meta
- HTML: <table>: The Table element - HTML: HyperText Markup Language | MDN
  - In summary, <tr> is a row, <th> is a heading cell, and <td> is a data cell. They can omit the end tag.
If you have concerns about the use of the corpus , do not hesitate to share the information with us.
Original poster is not very good at English, so please correct the wiki if you have the appropriate wording.
Feedback and questions are welcome.
- Discussions about the corpus are always welcome. However, perhaps this should be a separate topic.

Matters for consideration

For example, how about marking it as a WIP (work in progress) when someone is collecting?
- Pros: Avoid duplication of work.
- Cons: We must mark the wiki at the beginning and end of the collection.
- Cons: The possibility of volunteers abandoning the work in the middle of the process.
Do we need the Collected corpus section? Or Collected marks? (To avoid duplication of work.)
Immediately after posting, the table is written in Markdown, but would you prefer HTML?
- HTML:
  - Pros: Accessibility considerations - only the caption element, th element, and the scope attribute - are better than nothing.
    - HTML table advanced features and accessibility - Learn web development | MDN
  - Pros: Can make line breaks in cells. Can include lists, quotes, etc.
  - Cons: HTML isn’t too difficult, but it’s hard to know when we’ve made a typo in a tag . Discourse doesn’t validate tags.
- Markdown:
  - Pros: Everyone can write without knowledge. (Just select it in the editor.)
  - Pros: Bold and links are easy to do.
  - Cons: Can’t make line breaks in cells. Can’t include lists, quotes, etc.
How to indicate the language of the table
- Is it better to use the language code?
- If possible, the original poster would like the wiki to be simple and understandable for everyone.

Note

The goal is to report each other’s corpus so that we can be more efficient and active in each language’s collection activities. There is a concern that sharing information may lead to confusion in the work, but I would like to ask for your opinions on this issue.
Particularly with regard to the invalid corpus, it should be shared so as not to waste the volunteer’s efforts .
It also aims to find inappropriate corpus.

@sinumade wants to publish this wiki and list in the public domain. I asked in Can I waive my copyright? if the posts can be in the public domain and which license it belongs to, but I did not get an official answer from Mozilla (as of 2020-10-30). But if I could, I would. I want anyone who edits this wiki to do so with the intention of waiving their copyright .

sinumade · November 15, 2020, 4:16am

I consider that the collection and validation of texts should be done by people who are fluent in that language.
So I might suggest a corpus other than my native language, but it should be validated by people fluent in that language to see if it is a valid corpus.

I've already had my share of painful experiences with Japanese collections, and I don't want anyone else to make the same mistake.

The White House

Copyright Policy | The White House:

Pursuant to federal law, government-produced materials appearing on this site are not copyright protected. The United States Government may receive and hold copyrights transferred to it by assignment, bequest, or otherwise.

Seriously? I'm not familiar with the law, but can we add this to the list of appropriate corpus?

2020-11-15: I didn't know that, but it seems that copyright doesn't apply to law statements, news reports, etc. in Japan either.

What about your language? It's not "conversational" by any means, but we can increase the number.

Internet Archive

Internet Archive Search: licenseurl:http*publicdomain*
- ref. Search - A Basic Guide – Internet Archive Help Center: See: Can I search by Creative Commons license?

Most of the resources may be useful. But there is also content that is in clear violation of rights. (For example, to narrow it down to Japanese, there are videos of famous anime and pictures of idols.)
If we can verify each one of these licenses, we may be able to add them to the list, but what do you think?

datos.bne.es

Datos enlazados en la BNE. Biblioteca Nacional de España
- datos.bne.es

Spanish. I can't read.
Maybe it's CC0, but I don't know.
The Biblioteca Nacional de España comes up somewhat by searching for CC0.

Is there anyone who can look it up?
Are there any sentences that could be used for these resources?

Mozilla

site:mozilla.org CC0 - Google

I'm sure there are people here who are familiar with Mozilla, so I'm going to ask: Are there any CC0 works in Mozilla's public resources?

All I could find was an MDN code sample: Code samples and snippets - About MDN Web Docs - The MDN project | MDN

OSCAR

I mention this because I know some of you may want to add it to the list.

I've downloaded the Japanese file.
Some of the text had unique, identifiable sentences; a quick Google search shows that they were extracted from personal sites, corporate promotions, reports of charitable activities, porn sites, etc. There were also a lot of proper nouns (names of identifiable individuals, groups and works).

I have contacted OSCAR about this and am waiting to hear back. (What process did they use to get the text, to check if it's legitimate, etc.)
But whatever the reply, I will not add OSCAR to the list.

If you think it's appropriate, useful, and worthy of use in other languages, please add it to the list, with the Note field mentioning "Japanese files have concerns".

In Japanese

リストに追加したい人もいると思うので、触れておきます。

私は日本語ファイルをダウンロードしました。
コーパスの一部には、独特の、特定可能な文章がありました。Googleで検索してみると、それらは個人サイト、企業の宣伝、慈善団体の活動報告、ポルノサイト等から抽出されたことがわかります。固有名詞（特定可能な個人や集団、作品の名前）もたくさんありました。

私はこの件についてOSCARに問い合わせ、返事を待っているところです。
でも、どのような返事であれ、私はOSCARをリストに追加しません。

もし他の言語では適切で、有用であり、使用に値するというのであれば、「日本語ファイルには懸念がある」とNote欄に記載した上で、リストに追加して下さい。

sinumade · November 15, 2020, 4:35am

Write rather than Translate

I said, Translation is not easy, but it can be a good alternative .
However, as @nukeador has already mentioned in Post #2 on Problems finding public domain sentences, it would be more efficient and the quality of the text would be more reliable if you create your own text rather than translate it.
Translation requires an understanding of the foreign language and the ability to edit the words properly.
For example, you can use a machine to do automatic translation and then rework the generated text into something completely different by yourself. To the point of using it as a "material", I think anyone can make a corpus of foreign languages useful.

You may not be comfortable with the idea of writing it yourself.
But, as I mentioned in Ideas for finding public domain text, it can be done by tweet, chat or email.
If you don't have that either, then it can be a description of an everyday action you're doing, a landscape. Like, "My neighbor's dog is annoying," or "I posted this on the forum but no one is liking it". Your soliloquy will help everyone.
I like those sentences, they're easy and many people will enjoy reading them.

Everyone has the secret to creating a corpus.

青空文庫Aozora Bunko

公開中作家リスト：全て - 青空文庫 (List of authors of available works)
青空文庫編青空文庫収録ファイルの取り扱い規準 (Handling Standards for Aozora Bunko Files)

In Japanese

青空文庫には、著作権切れの作品を電子化したコンテンツがあります。
青空文庫から返信を頂きましたが、「取り扱い規準」のメタ情報やクレジットの希望は、あくまで「希望・期待」である、とのことでした。
著作権の発生する規定が「思想又は感情を創作的に表現したもの」で、著作物に「創作性」が必要なことを考慮すると、青空文庫内の著作権切れ作品は、パブリックドメインのままである（＝青空文庫は著作者ではない）と私は判断しました。

収集の際には、作品がパブリックドメインであるか確認することを忘れないで下さい（青空文庫には、著作権が存続する作品もあります。註記もありますが、収集する人が必ず自分で確認して下さい）。

著作権の解釈に関して誤りがあれば、ご指摘下さい。

註：旧字（正字）・旧仮名（歴史的仮名遣、正仮名遣）は、今日日常的には用いられていないので、リストから「新字新仮名」版を探すか、収集する人が新字・新仮名（現代仮名遣）に編集して下さい。

参考 (Japanese Copyright Law)：著作権法 - e-Gov法令検索

In English

Aozora Bunko contains digital versions of out-of-copyright works.
I received a reply from Aozora Bunko, stating that the metadata and credit wishes in the "取り扱い規準 (Handling Standards)" are just "hopes and expectations".
Considering the fact that the provision that creates copyright is "思想又は感情を創作的に表現したもの (creative expression of thought or feeling)" and that works need to be "創作性 (creative)", I decided that the out-of-copyright works in Aozora Bunko remain in the public domain (i.e. Aozora Bunko is not the author).

When collecting, don't forget to check whether the work is in the public domain (some works in the Aozora Bunko are still in the copyright. There are notes, but collectors should always check for themselves).

Please point out any errors in the interpretation of copyright.

Note: "旧字旧仮名" (Kyūjitai and Historical kana orthography) is not commonly used in Japan today, so please look for the "新字新仮名" version in the list, or edit it to Shinjitai and Modern kana usage by the collector.