Sentence collector copyright issues

Thank you @mkohler, I reposted allowable sentences from the first link.

There are a lot of sentences from Tatoeba in the Norwegian Bokmal collection. As noted for Swedish, this collection is mostly CC-BY, and there are only two Bokmal sentences in the entire collection marked CC0. Bit of a shame since it looks like it might be most of the sentences

Incidentally, https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-47/ appears to be a high quality source of thousands of CC0 sentences. Iā€™m happy to import them if appropriate

1 Like

Unfortunately quite a few sentences, but it is what it is. Iā€™ve taken care of those.

ā€œIn total, the material consists of approximately 700,000 translation pairs/sentence pairs.ā€ sounds very promising indeed. License seems fine as well. I think importing those would be great. You might want to have a look at the bulk import process though, as going through Sentence Collector with that many sentences is not really efficient. See ā€œBulk submissionā€ at https://common-voice.github.io/community-playbook/sub_pages/text.html.

I did a quick scan through polish sentences and found some unwanted sources from opensubtitles:

ā€œsourceā€: ā€œOpen Subtitles: https://www.opensubtitles.org/pl/subtitles/7665545/z-nation-at-all-cost-plā€,
ā€œsourceā€: ā€œhttps://www.opensubtitles.org/pl/subtitles/7716578/the-100-sanctum-plā€,
ā€œsourceā€: ā€œhttps://www.opensubtitles.org/pl/subtitles/7719533/guava-island-plā€,
ā€œsourceā€: ā€œhttps://www.opensubtitles.org/pl/subtitles/7723395/the-orville-sanctuary-plā€,
ā€œsourceā€: ā€œhttps://www.opensubtitles.org/pl/subtitles/7724670/brooklyn-nine-nine-he-said-she-said-plā€,
ā€œsourceā€: ā€œopen subtitlesā€,
ā€œsourceā€: ā€œopensubtitles and project gutenbergā€,
ā€œsourceā€: ā€œopensubtitles.orgā€,

Thanks. Iā€™ve taken care of this.

In Toki Pona (tok), there are some sentences credited to http://tokisoweli.blogspot.com/, which doesnā€™t mention any specific rights other than ā€œmi pana e sitelen ali mi tawa jan ale.ā€ (ā€œI give all my writings to everyone.ā€).

Thanks for reporting.

@heyhillary is this enough for us or should I remove those sentences?

Based on the CC0 waiver process, this wouldnā€™t be enough.

By any chance @Sobsz are you in contact with the author ? So they could formally dedicate their works under cc0 ?

The author hasnā€™t posted publicly on the internet since 2019, so contact is unlikely. Weā€™re doing well in terms of sentence count, though, so itā€™s not a big loss.

This now has been taken care of by deleting these sentences.

Please consider removing with other sentences by searching root path ā€œhttps://www.studylight.org/bible/korā€, since I can see many more.
Thank you.

Thanks. Iā€™m sceptical whether those really are copyrighted. Before I delete those, I would like to know more. @heyhillary can you have a look at this please? Thanks!

Oh sorry, I checked it again and It looks like fine, these look like copyright-expired version of translation.

I misunderstood because these sentence look like most recent version of bible translation (which is still covered by copyright), since these sentences are weird like it. These translations are not expressed in every-day expressions, which makes very difficult to read and understand it.

  • korean bible society - copyright notice (korean) - this translation, ā€œģ„±ź²½ģ „ģ„œ ź°œģ—­ķ•œźø€ķŒā€ is listed as expired at 2011-12-31.
  • Most used edition by Presbyterianism in S. Korea is revised translation of it, ā€œģ„±ź²½ģ „ģ„œ ź°œģ—­ź°œģ •ķŒā€, 4th edition, which is still covered by copyright (1st edition: 1998-08-31~, 70 years from it).

This translation is almost same, at least to me (side-by-side view - press ā€œģ½źø°ā€), since these share same property - nonnatural and old expressions. Sentences like these are never used in everyday speaking and writing, even in books.

I just wanted to bump this. Although the old Sentence Collector is archived and some of its functionalities are incorporated into the main CV under the write & review pages, copyright/license related problems are still valid and being asked in Matrix channel for example.

I think the information given at the top are outdated now, but after that most of the Q&A are good. Maybe it will be better to start a new one using current info, also giving a link to this one.

Forums are better than chats to keep the info organized, I thinkā€¦