Sentence collector copyright issues

@sinumade thanks again for flagging this, after review from our Mozilla Legal counterparts it’s been determined that this corpus is not fit for CC0 contribution and all usage should be removed from Common Voice.

@mkohler will work to remove this from the sentence collector, and any sentences that were merged to the primary platform for voice contribution will also be removed from the dataset. cc @phirework who’ll need to take that action.

Japanese language collection, again.

Are the reviewers, users Rrock9312 and Rrock2139 the same person?


I checked common-voice/sentence-collector.json. Probably the current Japanese source text isn't all in the public domain. That's a shame.

  • If the current 5,528 sentence is deleted, will the voice no longer be able to be recorded? (i.e., less than 5,000 sentences)
  • If the full text is not available, will it also be removed from the dataset, meaning the dataset cannot be provided? (so, we can't use already recorded voices?)
  • Perhaps the current Sentence Collector has the above source and the text I've added, but are there any other texts? I'd like to check if there is one ...... I don't trust the old collection. (Of course, I want a third party to verify the source I used too.)

  • We may need to verify the source provided by navta. This person participates in other languages as well (at least in the English collection).
  • Seriously, we should ask for volunteers to verify all language sources (I really want the Mozilla staff to verify it, but they'll need time to do so). There are too many resources to lose, including the voice, the dataset itself, etc.
    • Go to the source and check the license. Randomly extract text from the source and search the web for it (to check that the source is not an unauthorized reproduction).
    • However, it's useless unless it's verified by a trusted person. Just pressing the confirm button is not proof.
  • commonvoice.mozilla.org should mention this danger. They are now reading text that is not in the public domain. Users of the dataset are using data that they can't use.

This is insane, and a betrayal to the people.

Ping @mkohler for the mentioned copyright violations
Both Rrock users look suspicious, and there is also some asdf user? Could you check if any of them have any sentences marked invalid? If they have none, I’d say those are most definitely invalid reviews and should be deleted (or japanese submissions are without any mistakes, but I have a hard time believing in that)
If there remain less than 5k sentences, I’m afraid japanese collection will have to stop once again (but either way, 5k is awfully little to expect any great amount of useful collecting to be taking place)
If something has already been recorded, what to do then is a question for more qualified people than me, but it could unfortunately end in lot of bad ways
Second party checking the sources would be great, but now reviews at least show the sentence sources, which is still better than nothing
Mozilla staff can’t viably verify all of the sources for all of the languages, even if only because they don’t speak those. The best you can do is to defer to the local communities

1 Like

@mkohler Hindi dataset, https://tatoeba.org/eng as source, five random sentences with this source were checked, all were incompatible licence (CC-BY 2.0 FR)

Both the Japanese and Hindi issues have been resolved

Sorry, I can’t access to ‘kinto’ url, so can’t provide in requested format.
Ukrainian sentences contain a lot of text from a single book. And they have very bad quality.

Source: cetation of Володимир Білінський
The book is
“Білінський Володимир Броніславович. КРАЇНА МОКСЕЛЬ, або МОСКОВІЯ”
Very “holy-war” and politized one.

Sentence:
“Ростовсько-суздальська земля до приходу Рюриковичів зі своїми ватагами давно була заселена фінськими племенами”
Link

Sentence:
“Початковий київський літопис доволі точно позначає місця проживання цих племен: він знає”
Link

Sentence:
“Здавалося б, за церковними канонами, «царство Боже й царство кесареве мусять залишатися навіки розділеними»”
Link

Sentence:
“Карамзін був палким великоросом на службі в імперії й Государя”
Link

Link to google book (not sure if it’s legal copy)
Google Book

1 Like

And yet another one. It contains a lot of old/rare words. The meaning of some of them I even don’t know.

Book “Меч Арея” by Іван Білик. Link to online copy here

Sentence:
“І не питаєш, чого ради смо прийшли?”
Link

Sentence:
“Пощо тобі князь? — жваво всміхнувся отрок у сорочці.”
Link
Link to source

Sentence:
“Богдан пропустив його слова повз вуха, й усі троє посідали на коней.”
Link

Also, there are a lot of pending sentences from this book. All marked as
“Source: cetation of Іван Білик”. It would be nice to clear them automatically.

These two books kill all pleasure from the project. Unfotunately, 80% of sentences now come from them.
Could you resolve it somehow, please?

@mkohler For the Swedish dataset, there are now a lot of sentences (more than 1000) in the queue with https://tatoeba.org/ given as source and CC-BY 2.0 FR as the stated license.

я б теж підняв питання про прибирання цієї ЄРЕСІ з базового набору

1 Like

Hi I added some sentences from Tatoeba to the Swedish sentences collector but I have since learned that the lichens is incompatible with Mozilla common voice. So they need to be removed.

Both Ukranian and Swedish has been taken care of and the migration should be deployed in a few minutes.

1 Like

Just realized that there are quite a lot of subtitles from the movie The Big Lebowski in the Swedish dataset that likely come from Opensubtitles and has been there at least since August 2019.

Examples:
Jag är The Dude.
Jag kan inte skicka en räkning till kinesen som pissade på mattan.

(Some of them are easily identified by the use of “The Dude” which apart from the copyright issue obviously isn’t ideal for capturing pronunciation in Swedish since most Swedish-speaking people would still use English pronunciation for that. Also there are some offensive sentences in there.)

The sentences have been removed from the Sentence Collector. Thanks for reporting!

Cantonese (yue):
There are at least 200 sentences from https://hkbus.fandom.com/wiki/九巴286C線, but that site is licensed under CC BY-SA 3.0.

Also the sentences submitted are not Cantonese, but Standard Written Chinese, which is a standardized form of Mandarin.

Thanks for reporting, these sentences have been removed.

Chinese - China (zh-cn):
Many sentences are from https://zh.wikisource.org/wiki/国家监委调查组负责人答记者问, which is from a Chinese government press conference transcript. It was removed from Wikisource for copyright violation.

Many sentences are also from 《毛泽东选集》, which is the Selected Works of Mao Zedong. These are not in public domain.

Thanks for reporting, this will be taken care of with the next deployment of the Sentence Collector.

2 Likes

@mkohler @ftyers

There are some ~590 sentences waiting in Turkish. I scanned the first 10-20 or so, they are from:
https://tr.wikisource.org/wiki/En_Alttakiler

It is indicated as “public domain in Turkey” at the top, but at the bottom CC BY-SA…

Also many of them are incomplete sentences, only sentence parts divided at any punctuation, including commas. So many of them are grammatically incorrect anyway…

PS: I did not review all of them… Accepted a couple then reviewed the CC0 status.

Edit: Scanned other sources in the set, they are mostly poetry with similar copyright status. Such as
https://tr.wikisource.org/wiki/Takatım_Tak_Oldu_Bican_Olmuşum
https://tr.wikisource.org/wiki/Çocuklara
https://tr.wikisource.org/wiki/Bir_Roman_Kahramanı
https://tr.wikisource.org/wiki/Sayfa:Halk_Edebiyatı_Antolojisi.pdf/238

Hey there! The CC-BY-SA at the bottom is the copyright of the MediaWiki interface. But in any case it looks like a copyrighted book and published recently, so not public domain / CC-0. If you think the sentences are worthwhile, get in contact with the authors, if not, then I’d say we should just delete them. Could you file an issue on sentence collector ?

I wonder if there is a way to find out who uploaded them to help them work more effectively?

1 Like

No, this is exactly what this thread here is for. I’ll take care of this.

1 Like