Sentence collector copyright issues

If it is “public domain”, then it counts as CC0, so that is ok. But in any case if the sentences are poorly segmented, I think probably we’d want to delete them anyway and just re-import them.

As for poetry, I think this is not the ideal text… we want stuff that is dialoguey as far as possible. Think software with voice interaction. As much fun as it might be to talk in Turkish or Ottoman poetry with my GPS assistant, perhaps we should start with more day-to-day texts :smiley:

Having a Turkish community subforum sounds like an excellent idea. We could do it with a Matrix chat, or perhaps ask for a subforum on Discourse?

1 Like

Oops sorry ! :slight_smile: Belay that suggestion!

If you can delete the existing ones, I’ll be happy to re-add them correctly…

Matrix flows away, discourse is best. As many natives do not know English good enough and information is very scattered, I would like to collect guidelines in Turkish in there.

Sorry for hacking the thread…

So, to get it right, I should delete all sentences coming from the sources mentioned in your original post here? Sentence collector copyright issues

Anything else?

Sorry for the late reply. I checked sources I could access and suggest deletion of the list below. Probably these are all of them. All are incomplete sentences or verses of poetry…

(the links got malformed in discourse thou)

https://tr.wikisource.org/wiki/En_Alttakiler
https://tr.wikisource.org/wiki/Takatım_Tak_Oldu_Bican_Olmuşum
https://tr.wikisource.org/wiki/Çocuklara
https://tr.wikisource.org/wiki/Bir_Roman_Kahramanı
https://tr.wikisource.org/wiki/Sayfa:Halk_Edebiyatı_Antolojisi.pdf/238
https://tr.wikisource.org/wiki/Sayfa:Halk_Edebiyatı_Antolojisi.pdf/237
https://tr.wikisource.org/wiki/Sayfa:Halk_Edebiyatı_Antolojisi.pdf/236
https://tr.wikisource.org/wiki/Arzulayıp_Çıktım_Gurbet_Eline
https://tr.wikisource.org/wiki/Festival
https://tr.wikisource.org/wiki/Sayfa:C.H.P.15.Y%C4%B1l_Kitab%C4%B1(1938).pdf/588
https://tr.wikisource.org/wiki/Mesnevi
(Konuk)/1.Defter/1451-1500
https://tr.wikisource.org/wiki/Türkiye_Cumhuriyeti_Nafıa_Vekâleti_Devlet_Demiryolları_Samsun-Sivas_Demiryolu_Amasya
%C4%B0stasyonu%27nun_%C4%B0%C5%9Fletmeye_K%C3%BC%C5%9Fad%C4%B1/Foto%C4%9Fraflar
https://tr.wikisource.org/wiki/Pireli_Şiir
https://tr.wikisource.org/wiki/Yeni_Ahit/Luka/21
https://tr.wikisource.org/wiki/Nutuk/20.b%C3%B6l%C3%BCm/Vesika_162
https://tr.wikisource.org/wiki/Yazın_Evel_Baharında
https://tr.wikisource.org/wiki/Denizde_Akşam
https://tr.wikisource.org/wiki/Sayfa:Kutadgu_Bilig_Tıpkıbasım
(Fergana_N%C3%BCshas%C4%B1).pdf/5
https://tr.wikisource.org/wiki/Nutuk/20._bölüm/Vesika_184
https://tr.wikisource.org/wiki/Nutuk/20.b%C3%B6l%C3%BCm/Vesika_155
https://tr.wikisource.org/wiki/Sayfa:Bektaşi
%C5%9Eairleri_ve_Nefesleri_19_As%C4%B1ra_Kadar_Cilt_1-2.pdf/20
https://tr.wikisource.org/wiki/Karanlık
https://tr.wikisource.org/wiki/Kızılcık
https://tr.wikisource.org/wiki/Nutuk/20.b%C3%B6l%C3%BCm/Vesika_256
https://tr.wikisource.org/wiki/İstanbul'da_Semai_Kahveleri_ve_Meydan
%C5%9Eairleri
https://tr.wikisource.org/wiki/İnsanlar
https://tr.wikisource.org/wiki/C.H.P._Dördüncü_Büyük_Kurultayında_Genel_Başkan_Kamâl_Atatürk'ün_Söylevi
https://tr.wikisource.org/wiki/Halk_Edebiyatı_Antolojisi/Karacaoğlan
https://tr.wikisource.org/wiki/Halk_Edebiyatı_Antolojisi/Öksüz_Dede
https://tr.wikisource.org/wiki/Nutuk/7.b%C3%B6l%C3%BCm/Anzavur_isyanlar%C4%B1,D%C3%BCzce%C4%B0syan%C4%B1
https://tr.wikisource.org/wiki/Bektaşi
%C5%9Eairleri_ve_Nefesleri/Kaygusuz_Abdal

Thank you @mkohler

Thanks for the list @bozden. This has now been taken care of!

1 Like

Thank you @mkohler, I reposted allowable sentences from the first link.

There are a lot of sentences from Tatoeba in the Norwegian Bokmal collection. As noted for Swedish, this collection is mostly CC-BY, and there are only two Bokmal sentences in the entire collection marked CC0. Bit of a shame since it looks like it might be most of the sentences

Incidentally, https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-47/ appears to be a high quality source of thousands of CC0 sentences. I’m happy to import them if appropriate

1 Like

Unfortunately quite a few sentences, but it is what it is. I’ve taken care of those.

“In total, the material consists of approximately 700,000 translation pairs/sentence pairs.” sounds very promising indeed. License seems fine as well. I think importing those would be great. You might want to have a look at the bulk import process though, as going through Sentence Collector with that many sentences is not really efficient. See “Bulk submission” at https://common-voice.github.io/community-playbook/sub_pages/text.html.

I did a quick scan through polish sentences and found some unwanted sources from opensubtitles:

“source”: “Open Subtitles: https://www.opensubtitles.org/pl/subtitles/7665545/z-nation-at-all-cost-pl”,
“source”: “https://www.opensubtitles.org/pl/subtitles/7716578/the-100-sanctum-pl”,
“source”: “https://www.opensubtitles.org/pl/subtitles/7719533/guava-island-pl”,
“source”: “https://www.opensubtitles.org/pl/subtitles/7723395/the-orville-sanctuary-pl”,
“source”: “https://www.opensubtitles.org/pl/subtitles/7724670/brooklyn-nine-nine-he-said-she-said-pl”,
“source”: “open subtitles”,
“source”: “opensubtitles and project gutenberg”,
“source”: “opensubtitles.org”,

Thanks. I’ve taken care of this.

In Toki Pona (tok), there are some sentences credited to http://tokisoweli.blogspot.com/, which doesn’t mention any specific rights other than “mi pana e sitelen ali mi tawa jan ale.” (“I give all my writings to everyone.”).

Thanks for reporting.

@heyhillary is this enough for us or should I remove those sentences?

Based on the CC0 waiver process, this wouldn’t be enough.

By any chance @Sobsz are you in contact with the author ? So they could formally dedicate their works under cc0 ?

The author hasn’t posted publicly on the internet since 2019, so contact is unlikely. We’re doing well in terms of sentence count, though, so it’s not a big loss.

This now has been taken care of by deleting these sentences.

Please consider removing with other sentences by searching root path “https://www.studylight.org/bible/kor”, since I can see many more.
Thank you.

Thanks. I’m sceptical whether those really are copyrighted. Before I delete those, I would like to know more. @heyhillary can you have a look at this please? Thanks!

Oh sorry, I checked it again and It looks like fine, these look like copyright-expired version of translation.

I misunderstood because these sentence look like most recent version of bible translation (which is still covered by copyright), since these sentences are weird like it. These translations are not expressed in every-day expressions, which makes very difficult to read and understand it.

  • korean bible society - copyright notice (korean) - this translation, “성경전서 개역한글판” is listed as expired at 2011-12-31.
  • Most used edition by Presbyterianism in S. Korea is revised translation of it, “성경전서 개역개정판”, 4th edition, which is still covered by copyright (1st edition: 1998-08-31~, 70 years from it).

This translation is almost same, at least to me (side-by-side view - press “읽기”), since these share same property - nonnatural and old expressions. Sentences like these are never used in everyday speaking and writing, even in books.