Thank you for the answer. Yeah, it will be interesting to see the summary.
Maybe a useful approach to others: We started using translations from the English CommonVoice corpus as a source for adding sentences in Arabic.
This requires native speakers to confirm the accuracy of the translations (because we wouldn’t want blindly translated phrases to go to the sentence collector). But it’s a starting point that may work for other languages as well that struggle finding enough CC0 phrases.
@nukeador Is there any new guidance on using sentences from Tatoeba? The sentences we are considering for Arabic are licensed at CC-BY
. But as I wrote above, I think the license applies to the entire database, not each individual sentence (all of which have been used in many places previously, including other copyrighted material).
We have analyzed the Tatoeba database and found 31,806 Arabic sentences. They are all good quality. Would randomly selecting 5,000 or any other number for inclusion in the sentence collector be a violation of the CC license?
It would be great to have definitive guidance since I’m sure many other people are finding and collecting sentences from other sources but are unsure about the legal questions (or simply go ahead regardless).
Maybe reaching out to Tatoeba would be possible so that including random subsets (rather than the entire database) would get an explicit exemption?
Later this week I’ll be posting an update on sentence collection from the team that I hope will help in this matter.
I uploaded around 14,000 modern standard Arabic sentences in the sentence collector that need verification.
Any update on the usage of CC-BY?
For now we should we stick with public domain. The update I posted was about the “fair-use” of some large sources of text and the work we started doing with wikipedia