We need a text corpus link

日本語版: 我らテキストコーパスのリンク集を作るべし

Why don't we make a collection of links to the text corpus? Why don't we know who used which corpus?
It's too inconvenient.
Or is there already a collection of links to the corpus? Am I just missing it?

There are some well-known names like Wikipedia and OSCAR, and there will be individual creators who may only know individual volunteers. There will also be many communities that have adopted the Creative Commons.
In Japan, 星空文庫Hoshizora Bunko is a good example. As of September 25 2020, there are 682 CC0 works posted here. But the site is well known, so maybe someone else has already added it to the Collector.
There's no way to know that. "Has anyone used this site?" I can't search for the added sentence, nor can I know the source of it. (Yes, how to do it is on Sentence collector copyright issues, but I didn't know that was possible until I read this topic. There should be an easier way to reference it.)

We need a corpus link from two directions:

  • Volunteers who discover the corpus will add links to it.
  • Link the source sent to the Sentence Collector.

We make a list of it and make it searchable.

I find the same issue and potential of a text corpus, so as a workaround in Trad. Chinese we just create another repo at Github and manual collecting the text that was for Common Voice.

The real scenario is that people donated text to text corpus via local or online events, then I submit those sentences to Sentence Collector.