Collection of links to the text corpus. Add to it freely.
Even if we don’t have a corpus in your language, it is in the public domain and we can translate it. Translation is not easy, but it can be a good alternative .
Of course, it will also help those who use it for purposes other than Common Voice.
Sentence collection is the origin of voice recordings and datasets and is an important part of Common Voice. Please share the corpus that you know and help all .
- To start collecting voices, it requires 5,000 validated sentences.
- Proper training of the system requires 1,800,000 sentences (for 2,000 hours of voice data).
- ref. My language is now collecting voice, what do I need to know?
- ref. Mozilla Voice Community Playbook
- How to find public domain texts:
- Before adding the corpus to the list, please search this wiki. Press Ctrl and F on your keyboard to bring up your browser’s search bar.
- Corpuses should be lined up by language to make them easier to find. Individual corpus should be added at the bottom .
- If you don’t know if you can add a corpus to the list, please ask.
- Languages that are not on Common Voice can be added to the list, but it is recommended that you post a topic requesting a language to be added to the Common Voice category first. See the topic Readme: How to see my language on Common Voice.
-
Check the license carefully . Especially if:
- the author and publisher of the text are different.
- a large corpus.
- there is more than one author.
- If there is a copyright issue, the sentence will be removed from the sentence collection. Voice data may become unusable as well. Everyone’s efforts will be for naught.
Even if you are not interested in exploring, collecting, or validating the corpus, confirming that it is in the proper public domain will make a great contribution. Please be the “gatekeeper” for everyone’s efforts.
How to fill each field
- Leave blank each field where you don’t know the details.
1. Corpus
Link to the corpus.
- Link to the content we (will) collect .
- If a list exists, link to the root of the list (a page with full view of the public domain works).
- If only a portion of the content is in the public domain, mention that in the Note field.
- Be precise and concise with the name of the corpus.
- If you don’t know what it is, write the title of the page. (Page headings, browser title bar, etc.)
- If there is a specific version of the corpus, state that as well.
- e.g. The Sinumade Book of Adventures (2020 edition)
2. Language
The language must be written as indicated in the Sentence Collector. For example, Chinese is Chinese for any region.
- If there is more than one, separate them with a comma . They are written in alphabetical order. Example: English, French, German
- If the language is not in the Sentence Collector, put a + sign on the name of the language. Example: English+
3. State
If possible, mark the following:
- CC0 : The corpus text indicates permission. Or, it links to a document that indicates permission.
- PD : Public Domain. It is mainly assumed to be a work whose copyright has expired. If the copyright holder has waived the rights to the work, make it CC0 .
4. Permission
A link to a document that indicates the corpus permission .
- Related documents other than the permission should be written in the Note field.
5. Note
What to consider about collection. For example, there are limitations on collection (e.g., only part of it can be collected) or that it needs to be edited.
Appropriate corpus
A corpus that has been confirmed to be in the public domain.
- As a general rule, only content created by the author must be collected . Except for content that is quoted, reproduced from other sources, or otherwise mentioned. The distinction must be made by a real person .
- In most cases,
blockquote
andq
elements are used in HTML to indicate a citation . (In some cases, these elements are not used.)
- In most cases,
- If you have contacted the rights holder and received permission, mark Contacted in the Note field. If possible, include the date of permission. For example,
- Contacted 2020-10-25
- Whenever possible, link to the document that indicates permission.
-
Before collecting , it is recommended that:
- you pull out a few lines at random and search for it on Google (the search engine with the most hits for that language). See if the corpus is illegal.
- we copy the work in Wayback Machine, as the author may delete it.
Corpus | Language | State | Permission | Note |
---|---|---|---|---|
Wikipedia | There are limitations . See: Sentence Extractor - Current Status and Workflow Summary | |||
GitHub - irvin/cc0-sentences | Chinese | CC0 | ||
mlog | English | CC0 | Everything by me – Happy GNU Year & Public Domain Day – mlog | |
zen habits | English | CC0 | Uncopyright : zen habits | Leo’s ebooks are also in the public domain. |
mnmlist | English | CC0 | » uncopyright mnmlist | |
星空文庫 | Japanese | CC0 | Only Public Domain category can be collected | |
deztec.jp | Japanese | CC0 | Info/趣味のWebデザイン | Contacted 2020-10-28 |
死ぬまで憶えておいて | Japanese | CC0 | sinumade.net 槪要 | Written in minor wording (Historical kana orthography and Kyūjitai) and need to be edited |
青空文庫 | Japanese | PD | Contacted 2020-11-14 / Only out of copyright works can be collected / See: Post #3 |
Candidate corpus (DO NOT USE this corpus)
A corpus that has not been confirmed to be in the public domain.
- It is necessary to contact the rights holder to obtain permission to use this corpus.
- Move the permitted corpus to the Appropriate corpus . Mark Contacted in the Note field.
- Move the non-permitted corpus to the Invalid corpus . Mark non-permitted in the Note field.
Corpus | Language | State | Permission | Note |
---|---|---|---|---|
Invalid corpus (DO NOT USE this corpus)
A corpus that must not be used .
For example, a corpus that was used but found to be inappropriate.
- If possible, state in the Note field why it is invalid.
- If you couldn’t get permission from the right holder, mark non-permitted in the Note field.
- If there is a problem with the corpus, mark Problem in the Note field and be specific and concise. (Details should be available for reference on a separate document.)
Corpus | Language | State | Permission | Note |
---|---|---|---|---|
Tanaka Corpus (Public Domain version) | English, Japanese | Problem : copyright issues: Post #11 on Sentence collector copyright issues / It was in the Japanese language collection until October 2020. |
Supplement
- Offline resources can also be added to the list. If so, please mark Offline in the Note field and include how to access it. For example,
- Offline : Stored in the sinumade library. Reader’s card required to view.
- The wiki can be edited when the Discourse trust level is Member or higher. Otherwise, please share the information in a reply. Someone else will add it to the wiki.
- Please include the information for each field in your reply. A list is fine. For example,
- Corpus :
- Language :
- State :
- Permission :
- Note :
- If the information is only a name or a link, those adding to the wiki should check the corpus and fill in the fields.
- ref. What is a Wiki Post? - howto / faq - Discourse Meta
- Please include the information for each field in your reply. A list is fine. For example,
- How to write a table:
- If you have concerns about the use of the corpus , do not hesitate to share the information with us.
- Original poster is not very good at English, so please correct the wiki if you have the appropriate wording.
- Feedback and questions are welcome.
- Discussions about the corpus are always welcome. However, perhaps this should be a separate topic.
Matters for consideration
- For example, how about marking it as a WIP (work in progress) when someone is collecting?
- Pros: Avoid duplication of work.
- Cons: We must mark the wiki at the beginning and end of the collection.
- Cons: The possibility of volunteers abandoning the work in the middle of the process.
- Do we need the Collected corpus section? Or Collected marks? (To avoid duplication of work.)
- Immediately after posting, the table is written in Markdown, but would you prefer HTML?
- HTML:
- Pros: Accessibility considerations - only the
caption
element,th
element, and thescope
attribute - are better than nothing. - Pros: Can make line breaks in cells. Can include lists, quotes, etc.
- Cons: HTML isn’t too difficult, but it’s hard to know when we’ve made a typo in a tag . Discourse doesn’t validate tags.
- Pros: Accessibility considerations - only the
- Markdown:
- Pros: Everyone can write without knowledge. (Just select it in the editor.)
- Pros: Bold and links are easy to do.
- Cons: Can’t make line breaks in cells. Can’t include lists, quotes, etc.
- HTML:
- How to indicate the language of the table
- Is it better to use the language code?
- If possible, the original poster would like the wiki to be simple and understandable for everyone.
Note
- The goal is to report each other’s corpus so that we can be more efficient and active in each language’s collection activities. There is a concern that sharing information may lead to confusion in the work, but I would like to ask for your opinions on this issue.
- Particularly with regard to the invalid corpus, it should be shared so as not to waste the volunteer’s efforts .
- It also aims to find inappropriate corpus.
- @sinumade wants to publish this wiki and list in the public domain. I asked in Can I waive my copyright? if the posts can be in the public domain and which license it belongs to, but I did not get an official answer from Mozilla (as of 2020-10-30). But if I could, I would. I want anyone who edits this wiki to do so with the intention of waiving their copyright .