Sentence collector copyright issues

moonhouse · February 1, 2021, 5:35pm

@mkohler For the Swedish dataset, there are now a lot of sentences (more than 1000) in the queue with https://tatoeba.org/ given as source and CC-BY 2.0 FR as the stated license.

velmyshanovnyi · February 2, 2021, 1:51am

я б теж підняв питання про прибирання цієї ЄРЕСІ з базового набору

mrkalling · February 3, 2021, 11:18pm

Hi I added some sentences from Tatoeba to the Swedish sentences collector but I have since learned that the lichens is incompatible with Mozilla common voice. So they need to be removed.

mkohler · February 7, 2021, 11:53am

Both Ukranian and Swedish has been taken care of and the migration should be deployed in a few minutes.

moonhouse · March 23, 2021, 6:41pm

Just realized that there are quite a lot of subtitles from the movie The Big Lebowski in the Swedish dataset that likely come from Opensubtitles and has been there at least since August 2019.

Examples:
Jag är The Dude.
Jag kan inte skicka en räkning till kinesen som pissade på mattan.

(Some of them are easily identified by the use of “The Dude” which apart from the copyright issue obviously isn’t ideal for capturing pronunciation in Swedish since most Swedish-speaking people would still use English pronunciation for that. Also there are some offensive sentences in there.)

mkohler · March 28, 2021, 1:08pm

The sentences have been removed from the Sentence Collector. Thanks for reporting!

hfhchan · July 2, 2021, 8:34am

Cantonese (yue):
There are at least 200 sentences from https://hkbus.fandom.com/wiki/九巴286C線, but that site is licensed under CC BY-SA 3.0.

Also the sentences submitted are not Cantonese, but Standard Written Chinese, which is a standardized form of Mandarin.

mkohler · July 6, 2021, 11:27pm

Thanks for reporting, these sentences have been removed.

phopo2 · August 7, 2021, 2:27am

Chinese - China (zh-cn):
Many sentences are from https://zh.wikisource.org/wiki/国家监委调查组负责人答记者问, which is from a Chinese government press conference transcript. It was removed from Wikisource for copyright violation.

Many sentences are also from 《毛泽东选集》, which is the Selected Works of Mao Zedong. These are not in public domain.

mkohler · August 7, 2021, 12:18pm

Thanks for reporting, this will be taken care of with the next deployment of the Sentence Collector.

bozden · September 11, 2021, 10:21am

@mkohler @ftyers

There are some ~590 sentences waiting in Turkish. I scanned the first 10-20 or so, they are from:
https://tr.wikisource.org/wiki/En_Alttakiler

It is indicated as “public domain in Turkey” at the top, but at the bottom CC BY-SA…

Also many of them are incomplete sentences, only sentence parts divided at any punctuation, including commas. So many of them are grammatically incorrect anyway…

PS: I did not review all of them… Accepted a couple then reviewed the CC0 status.

Edit: Scanned other sources in the set, they are mostly poetry with similar copyright status. Such as
https://tr.wikisource.org/wiki/Takatım_Tak_Oldu_Bican_Olmuşum
https://tr.wikisource.org/wiki/Çocuklara
https://tr.wikisource.org/wiki/Bir_Roman_Kahramanı
https://tr.wikisource.org/wiki/Sayfa:Halk_Edebiyatı_Antolojisi.pdf/238

ftyers · September 11, 2021, 2:16pm

Hey there! The CC-BY-SA at the bottom is the copyright of the MediaWiki interface. But in any case it looks like a copyrighted book and published recently, so not public domain / CC-0. If you think the sentences are worthwhile, get in contact with the authors, if not, then I’d say we should just delete them. Could you file an issue on sentence collector ?

I wonder if there is a way to find out who uploaded them to help them work more effectively?

mkohler · September 11, 2021, 2:33pm

No, this is exactly what this thread here is for. I’ll take care of this.

bozden · September 11, 2021, 2:39pm

The first one is a report of a Turkish NGO’s EU project. In the report it is declared as “public domain”. On the other hand exact wording results that it is not CC0 (Google Translate) - in fact a bit eclectic.

This report is in the public domain. It is possible to quote from the report by showing the source. All or part of the report may be printed, reproduced, photocopied, copied to electronic media or distributed widely without permission.

Anyway, most of the sentences are not complete…

The others are in fact public domain (folkloric or end of protection period). But these are poems ! I don’t know any rules towards these, but they are not natural sentences of course. I had already examine
Orhan Veli Kanık’s work, some verses can be appropriate but one must pre-select them…

If we could build a Turkish community we could know each other under a sub-forum but now I don’t know who posted these.

I would say delete them.

Thank you btw…

ftyers · September 11, 2021, 2:45pm

If it is “public domain”, then it counts as CC0, so that is ok. But in any case if the sentences are poorly segmented, I think probably we’d want to delete them anyway and just re-import them.

As for poetry, I think this is not the ideal text… we want stuff that is dialoguey as far as possible. Think software with voice interaction. As much fun as it might be to talk in Turkish or Ottoman poetry with my GPS assistant, perhaps we should start with more day-to-day texts

Having a Turkish community subforum sounds like an excellent idea. We could do it with a Matrix chat, or perhaps ask for a subforum on Discourse?

ftyers · September 11, 2021, 2:46pm

Oops sorry ! Belay that suggestion!

bozden · September 11, 2021, 2:57pm

If you can delete the existing ones, I’ll be happy to re-add them correctly…

Matrix flows away, discourse is best. As many natives do not know English good enough and information is very scattered, I would like to collect guidelines in Turkish in there.

Sorry for hacking the thread…

mkohler · September 11, 2021, 3:06pm

So, to get it right, I should delete all sentences coming from the sources mentioned in your original post here? Sentence collector copyright issues

Anything else?

bozden · September 11, 2021, 5:35pm

Sorry for the late reply. I checked sources I could access and suggest deletion of the list below. Probably these are all of them. All are incomplete sentences or verses of poetry…

(the links got malformed in discourse thou)

Thank you @mkohler …

mkohler · September 12, 2021, 10:57am

Thanks for the list @bozden. This has now been taken care of!

Topic		Replies	Views
Polish sentences concerns Common Voice sentence-collection , issue , dataset	20	3286	May 4, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3697	September 11, 2019
Sentence collection for Belarusian – request for advice Common Voice sentence-collection	16	1152	July 9, 2021
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8886	January 9, 2019
Problems finding public domain sentences Common Voice sentence-collection	26	2986	June 10, 2019

Sentence collector copyright issues

Related topics