We need a text corpus link

sinumade · September 25, 2020, 7:19pm

Why don't we make a collection of links to the text corpus? Why don't we know who used which corpus?
It's too inconvenient.
Or is there already a collection of links to the corpus? Am I just missing it?

There are some well-known names like Wikipedia and OSCAR, and there will be individual creators who may only know individual volunteers. There will also be many communities that have adopted the Creative Commons.
In Japan, 星空文庫Hoshizora Bunko is a good example. As of September 25 2020, there are 682 CC0 works posted here. But the site is well known, so maybe someone else has already added it to the Collector.
There's no way to know that. "Has anyone used this site?" I can't search for the added sentence, nor can I know the source of it. (Yes, how to do it is on Sentence collector copyright issues, but I didn't know that was possible until I read this topic. There should be an easier way to reference it.)

We need a corpus link from two directions:

Volunteers who discover the corpus will add links to it.
Link the source sent to the Sentence Collector.

We make a list of it and make it searchable.

irvin · October 7, 2020, 7:29pm

I find the same issue and potential of a text corpus, so as a workaround in Trad. Chinese we just create another repo at Github and manual collecting the text that was for Common Voice.

The real scenario is that people donated text to text corpus via local or online events, then I submit those sentences to Sentence Collector.

Topic		Replies	Views
Text Corpus Link Collection Common Voice sentence-collection	2	1717	November 15, 2020
Ideas for finding public domain text Common Voice sentence-collection	0	855	October 31, 2020
Problems finding public domain sentences Common Voice sentence-collection	26	3063	June 10, 2019
Sentence collector copyright issues Common Voice sentence-collection	54	6399	April 16, 2024
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3761	September 11, 2019

We need a text corpus link

Related topics