Problems finding public domain sentences

Geor · February 22, 2019, 9:52pm

Thank you for the answer. Yeah, it will be interesting to see the summary.

tinok · February 23, 2019, 7:32pm

Maybe a useful approach to others: We started using translations from the English CommonVoice corpus as a source for adding sentences in Arabic.

This requires native speakers to confirm the accuracy of the translations (because we wouldn’t want blindly translated phrases to go to the sentence collector). But it’s a starting point that may work for other languages as well that struggle finding enough CC0 phrases.

tinok · April 18, 2019, 2:16pm

@nukeador Is there any new guidance on using sentences from Tatoeba? The sentences we are considering for Arabic are licensed at CC-BY. But as I wrote above, I think the license applies to the entire database, not each individual sentence (all of which have been used in many places previously, including other copyrighted material).

We have analyzed the Tatoeba database and found 31,806 Arabic sentences. They are all good quality. Would randomly selecting 5,000 or any other number for inclusion in the sentence collector be a violation of the CC license?

It would be great to have definitive guidance since I’m sure many other people are finding and collecting sentences from other sources but are unsure about the legal questions (or simply go ahead regardless).

Maybe reaching out to Tatoeba would be possible so that including random subsets (rather than the entire database) would get an explicit exemption?

nukeador · April 24, 2019, 1:11pm

Later this week I’ll be posting an update on sentence collection from the team that I hope will help in this matter.

ktaa · May 28, 2019, 2:38pm

I uploaded around 14,000 modern standard Arabic sentences in the sentence collector that need verification.

davidak · June 8, 2019, 3:30pm

Any update on the usage of CC-BY?

nukeador · June 10, 2019, 1:40pm

For now we should we stick with public domain. The update I posted was about the “fair-use” of some large sources of text and the work we started doing with wikipedia

Topic		Replies	Views
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14387	May 10, 2022
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3753	September 11, 2019
Sentence collector copyright issues Common Voice sentence-collection	54	6351	April 16, 2024
Text Corpus Link Collection Common Voice sentence-collection	2	1717	November 15, 2020
I can't speak sentences in portuguese. There is no phrases for the language Common Voice participation , sentence-collection , feedback , issue , dataset	3	998	August 31, 2023

Problems finding public domain sentences

Related topics