Sentence collector copyright issues

This thread serves for reporting copyright issues arisen with sentences submitted to the sentence collector tool. Please report any sentences that you can find came from a source licensed in any other way than CC-0 (public domain works) as a reply in here.

When reporting, please supply at least:

  • Name of the person who submitted the sentences.
  • What was submitted as “source” for the sentences.

Optionally, you can submit also a link to the actual text the sentences were copied from.

How do I find the required information?

Currently, the easiest approach is to visit the following URL:
https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_<languageCode>/records, replacing <languageCode> with the two-letter code of the language the sentences were submitted to. For example, https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_en/records or https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_cs/records. On that address, you should be presented with a JSON data of the sentences in the collection tool. Search in there for one of the sentences you suspect to be submitted against our copyright requirements, and you are interested for the author and source fields of that sentence then. For example, in Firefox, when you are in the JSON view (selected using the bars at the top, also should be default after loading the page), expand the data array (by clicking on the little triangle in front of it). Then, type long enough part of the sentence that you can remember into the filter field just bellow the tabs at the top of the page. With a long enough part of the sentece types, you should see just one number bellow the “data” bellow. Remember that number, then delete everything in the filter box again. Scroll down until you find the number that you remembered, then click on the little triangle next to it, and copy here what you find on the lines following the words username and source.

If you are for any reason unable to do all that, you can also just copy & paste a few of the sentences you suspect break our copyright policy in here and we will also manage :slight_smile:

3 Likes

Sources used in polish collection which do not fall into CC0 category:

This is taken care of.

These Georgian sentences are not under the public domain:

  • “username”: “rigormortis”, “source”: “https://ka.wikibooks.org*”.
  • “username”: “Geor”, “source”: “Own work” – A movie scripts, without the CC0 license.
  • “username”: “rigormortis”, “source”: “https://ka.wikiquote.org*”.

Also, please remove the approved sentences with “invalid” flags. Most of them have typos.

Can you elaborate a bit more here? I’m a bit hesitant to just remove anything that ever got one invalid vote.

Then just remove those marked as invalid by Razmik, he found many mistakes.

Thanks!

Thanks

This is taken care of.

1 Like

The Polish review tab is currently filled with segments from Lord of The Rings, which is very much not public domain. Didn’t even bother slicing it into sentences… Username is narid, source is from the book. (again!).

Thanks for reporting this. These have been removed.

1 Like

Japanese language collector have the following problems:

Perhaps this is a problem with the corpus.

I went to the source page and checked the "Public Domain version" and it contains the above text. These sources are famous cartoons and games, and they are obviously not in the public domain. The "Public Domain version" file has a [Manga] flag, but some of the sentences are not. Honestly, I can't determine how much of the offending text is in the mix.

@sinumade thanks for reporting. I’m not a lawyer, so I can’t really answer that. @mbranson @jscowcroft any advise here?

ru got some scripts parsed from opensubtitles.org. It’s clearly marked in source, so should be easy to parse.

Thanks for flagging, we’ll take a look on this end and get back to you here. cc @

@sinumade thanks again for flagging this, after review from our Mozilla Legal counterparts it’s been determined that this corpus is not fit for CC0 contribution and all usage should be removed from Common Voice.

@mkohler will work to remove this from the sentence collector, and any sentences that were merged to the primary platform for voice contribution will also be removed from the dataset. cc @phire who’ll need to take that action.