Sentence collector copyright issues

Adrijaned · January 26, 2020, 6:12pm

This thread serves for reporting copyright issues arisen with sentences submitted to the sentence collector tool. Please report any sentences that you can find came from a source licensed in any other way than CC-0 (public domain works) as a reply in here.

When reporting, please supply the “source” text of the specific sentences. It can be found when reviewing sentences, right below each individual sentence in a gray font.
Optionally, you can also provide a link to the actual source the sentences were copied from.

If you have a reason to suspect copyright-wise unproblematic sentences might also be present with the same “source” text, you can let us know and we will try to come up with something, otherwise, that is all you need done to have the issue resolved

Note: 2020-11-03 post updated to reflect changes to sentence collector. Original version is below.

When reporting, please supply at least:

Name of the person who submitted the sentences.
What was submitted as “source” for the sentences.

Optionally, you can submit also a link to the actual text the sentences were copied from.

How do I find the required information?

Currently, the easiest approach is to visit the following URL:
https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_<languageCode>/records, replacing <languageCode> with the two-letter code of the language the sentences were submitted to. For example, https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_en/records or https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_cs/records. On that address, you should be presented with a JSON data of the sentences in the collection tool. Search in there for one of the sentences you suspect to be submitted against our copyright requirements, and you are interested for the author and source fields of that sentence then. For example, in Firefox, when you are in the JSON view (selected using the bars at the top, also should be default after loading the page), expand the data array (by clicking on the little triangle in front of it). Then, type long enough part of the sentence that you can remember into the filter field just bellow the tabs at the top of the page. With a long enough part of the sentece types, you should see just one number bellow the “data” bellow. Remember that number, then delete everything in the filter box again. Scroll down until you find the number that you remembered, then click on the little triangle next to it, and copy here what you find on the lines following the words username and source.

If you are for any reason unable to do all that, you can also just copy & paste a few of the sentences you suspect break our copyright policy in here and we will also manage

jakub.wrobel7 · February 7, 2020, 9:25pm

Sources used in polish collection which do not fall into CC0 category:

“source”: “From the book.”, “username”: “narid” – seems to be actually book “Biały Kieł/White Fang” not CC0 AFAIK. First polish translation was done by Anna Trzeciakowska https://pl.wikipedia.org/wiki/Anna_Trzeciakowska and was published in 1926. Ms Anna seem to be still in good health.
“source”: “https://www.gutenberg.org/files/6000/6000-h/6000-h.htm Project Gutenberg version of “Ironia Pozorow” by Maciej hr. Lubienski”, “username”: “hellbunnie” - license here is somewhat about freedom but probably not CC0 equivalent, author is still alive as mentioned by @Adrijaned
“source”: “https://pl.wikipedia.org/wiki/Zachęta_Narodowa_Galeria_Sztuki”, “username”: “michalstepien” - wikipedia originating sentences

mkohler · February 8, 2020, 6:15pm

This is taken care of.

G12r · March 17, 2020, 9:35pm

These Georgian sentences are not under the public domain:

“username”: “rigormortis”, “source”: “https://ka.wikibooks.org*”.
“username”: “Geor”, “source”: “Own work” – A movie scripts, without the CC0 license.
“username”: “rigormortis”, “source”: “https://ka.wikiquote.org*”.

Also, please remove the approved sentences with “invalid” flags. Most of them have typos.

mkohler · March 17, 2020, 11:10pm

Can you elaborate a bit more here? I’m a bit hesitant to just remove anything that ever got one invalid vote.

G12r · March 17, 2020, 11:54pm

Then just remove those marked as invalid by Razmik, he found many mistakes.

Thanks!

mkohler · March 18, 2020, 7:56pm

Thanks

This is taken care of.

Sobsz · May 26, 2020, 5:05pm

The Polish review tab is currently filled with segments from Lord of The Rings, which is very much not public domain. Didn’t even bother slicing it into sentences… Username is narid, source is from the book. (again!).

mkohler · May 27, 2020, 7:47am

Thanks for reporting this. These have been removed.

sinumade · September 25, 2020, 4:47pm

Japanese language collector have the following problems:

“username”: “navta”, “source”: “http://www.edrdg.org/wiki/index.php/Tanaka_Corpus”
1. “sentence”: “あきらめたら、そこで試合終了ですよ。”
  - From SLAM DUNK.
  - Ref: あきらめたら、そこで試合終了ですよ。 - Google 検索
2. “sentence”: “我が生涯に一片の悔いなし。”
  - From 北斗の拳.
  - Ref: 我が生涯に一片の悔いなし。 - Google 検索
3. “sentence”: “僕は新世界の神となる。”
  - From DEATH NOTE.
  - Ref: 僕は新世界の神となる。 - Google 検索
4. “sentence”: “あんたらの名前なんか興味ないね。どうせこの仕事が終わるとお別れだ。”
  - From ファイナルファンタジーVII.
  - Ref: あんたたちの名前なんか興味ないね。 - Google 検索
  - There are a few changes.

Perhaps this is a problem with the corpus.

I went to the source page and checked the "Public Domain version" and it contains the above text. These sources are famous cartoons and games, and they are obviously not in the public domain. The "Public Domain version" file has a [Manga] flag, but some of the sentences are not. Honestly, I can't determine how much of the offending text is in the mix.

mkohler · September 26, 2020, 1:12pm

@sinumade thanks for reporting. I’m not a lawyer, so I can’t really answer that. @mbranson @jscowcroft any advise here?

pdsfjd · September 29, 2020, 10:34pm

ru got some scripts parsed from opensubtitles.org. It’s clearly marked in source, so should be easy to parse.

mbranson · September 29, 2020, 10:54pm

Thanks for flagging, we’ll take a look on this end and get back to you here. cc @

mbranson · October 7, 2020, 4:57pm

@sinumade thanks again for flagging this, after review from our Mozilla Legal counterparts it’s been determined that this corpus is not fit for CC0 contribution and all usage should be removed from Common Voice.

@mkohler will work to remove this from the sentence collector, and any sentences that were merged to the primary platform for voice contribution will also be removed from the dataset. cc @phirework who’ll need to take that action.

sinumade · November 6, 2020, 6:32pm

Japanese language collection, again.

“username”: “navta”
- “source”: “http://d.hatena.ne.jp/satoru_net/20151030/1446184756”
  - This source is an unauthorized reproduction of ATR 503 sentences.
  - Corpus Name: ATR 503 sentences (Japanese name: ATR音素バランス503文)
  - Original is paid for: ATRデジタル音声データベース｜ATR音声言語データベース｜ATR-Promotions
- “source”: “https://github.com/voice-statistics/voice-statistics.github.com/blob/master/assets/doc/balance_sentences.txt”
  - Creator: 日本声優統計学会
  - Corpus Name: 音素バランス文
  - License: CC BY-SA 4.0; Creator said CC-BY-SA ライセンスで配布しています．
- “source”: “https://github.com/matbahasa/TALPCo/blob/master/data_jpn.txt”
  - Current: https://github.com/matbahasa/TALPCo/blob/master/jpn/data_jpn.txt
  - Corpus Name: TUFS Asian Language Parallel Corpus
  - License: CC BY 4.0; Corpus said TALPCo is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Are the reviewers, users Rrock9312 and Rrock2139 the same person?

I checked common-voice/sentence-collector.json. Probably the current Japanese source text isn't all in the public domain. That's a shame.

If the current 5,528 sentence is deleted, will the voice no longer be able to be recorded? (i.e., less than 5,000 sentences)
If the full text is not available, will it also be removed from the dataset, meaning the dataset cannot be provided? (so, we can't use already recorded voices?)
Perhaps the current Sentence Collector has the above source and the text I've added, but are there any other texts? I'd like to check if there is one ...... I don't trust the old collection. (Of course, I want a third party to verify the source I used too.)

We may need to verify the source provided by navta. This person participates in other languages as well (at least in the English collection).
Seriously, we should ask for volunteers to verify all language sources (I really want the Mozilla staff to verify it, but they'll need time to do so). There are too many resources to lose, including the voice, the dataset itself, etc.
- Go to the source and check the license. Randomly extract text from the source and search the web for it (to check that the source is not an unauthorized reproduction).
- However, it's useless unless it's verified by a trusted person. Just pressing the confirm button is not proof.
commonvoice.mozilla.org should mention this danger. They are now reading text that is not in the public domain. Users of the dataset are using data that they can't use.

This is insane, and a betrayal to the people.

Adrijaned · November 7, 2020, 2:28pm

Ping @mkohler for the mentioned copyright violations
Both Rrock users look suspicious, and there is also some asdf user? Could you check if any of them have any sentences marked invalid? If they have none, I’d say those are most definitely invalid reviews and should be deleted (or japanese submissions are without any mistakes, but I have a hard time believing in that)
If there remain less than 5k sentences, I’m afraid japanese collection will have to stop once again (but either way, 5k is awfully little to expect any great amount of useful collecting to be taking place)
If something has already been recorded, what to do then is a question for more qualified people than me, but it could unfortunately end in lot of bad ways
Second party checking the sources would be great, but now reviews at least show the sentence sources, which is still better than nothing
Mozilla staff can’t viably verify all of the sources for all of the languages, even if only because they don’t speak those. The best you can do is to defer to the local communities

Adrijaned · November 10, 2020, 10:16pm

@mkohler Hindi dataset, https://tatoeba.org/eng as source, five random sentences with this source were checked, all were incompatible licence (CC-BY 2.0 FR)

Adrijaned · November 11, 2020, 8:54pm

Both the Japanese and Hindi issues have been resolved

Slavik · December 14, 2020, 11:03am

Sorry, I can’t access to ‘kinto’ url, so can’t provide in requested format.
Ukrainian sentences contain a lot of text from a single book. And they have very bad quality.

Source: cetation of Володимир Білінський
The book is
“Білінський Володимир Броніславович. КРАЇНА МОКСЕЛЬ, або МОСКОВІЯ”
Very “holy-war” and politized one.

Sentence:
“Ростовсько-суздальська земля до приходу Рюриковичів зі своїми ватагами давно була заселена фінськими племенами”
Link

Sentence:
“Початковий київський літопис доволі точно позначає місця проживання цих племен: він знає”
Link

Sentence:
“Здавалося б, за церковними канонами, «царство Боже й царство кесареве мусять залишатися навіки розділеними»”
Link

Sentence:
“Карамзін був палким великоросом на службі в імперії й Государя”
Link

Link to google book (not sure if it’s legal copy)
Google Book

Slavik · December 14, 2020, 11:12am

And yet another one. It contains a lot of old/rare words. The meaning of some of them I even don’t know.

Book “Меч Арея” by Іван Білик. Link to online copy here

Sentence:
“І не питаєш, чого ради смо прийшли?”
Link

Sentence:
“Пощо тобі князь? — жваво всміхнувся отрок у сорочці.”
Link
Link to source

Sentence:
“Богдан пропустив його слова повз вуха, й усі троє посідали на коней.”
Link

Also, there are a lot of pending sentences from this book. All marked as
“Source: cetation of Іван Білик”. It would be nice to clear them automatically.

These two books kill all pleasure from the project. Unfotunately, 80% of sentences now come from them.
Could you resolve it somehow, please?

Topic		Replies	Views
Polish sentences concerns Common Voice sentence-collection , issue , dataset	20	3302	May 4, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3711	September 11, 2019
Sentence collection for Belarusian – request for advice Common Voice sentence-collection	16	1157	July 9, 2021
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8919	January 9, 2019
Problems finding public domain sentences Common Voice sentence-collection	26	2996	June 10, 2019

Sentence collector copyright issues

Related topics