Could it be possible to make some central place to report copyright issues with sentences uploaded to the sentence collector tool? These issues are quite common, but so far each one has to be solved individually.

@mkohler the latest import I found problematic is in czech dataset, from user “comodoro”, source “aktualne.cz daily news” (https://zpravy.aktualne.cz/zahranici/tepla-morska-vlna-zabila-milion-ptaku/r~f9534962391911ea9b40ac1f6b220ee8/)

Feel free to open an issue in the sentence collector repo, but I don’t think we can do anything for that right now. I for example don’t think that simply reporting per sentence would work out. So it possibly would still be on a submission basis. But then it’s not a given that all sentences from a submission are copyrighted. Of course this can be brainstormed, but I would not do anything here in the near future, given that most other issues need a backend as well.

Will have a look tomorrow.

… actually, let’s brainstorm here and then file an issue once there is a possible way that would work out nicely. :slight_smile:

As for the central place, I was thinking along just something like a single permanent github issue or discourse thread linked from the sidebar of the sentence collector website reports could be added as comments to, to keep them in central-ish place.

Oh, I see. Yes, that sounds reasonable. Wanna start a new thread and I’ll link it?

Do we want a github issue or discourse thread though? Discourse should have lower entry barrier as far as I know. Or maybe in both places and link each other from the thread?

Perhaps there could be a blacklist of domains we definitely know are not CC0 and alert the user if they put that domain as the source?

That would’ve been a LOT of domains, + users tend to write whatever in the source field (I usually write something like “PD_old_70, , taken from wikisources”, or “Own work”)

Yeah, I meant a list based on sources that were already submitted and rejected.

The problem is that the source field is just a text box and there’s no standardized way for people to represent a source. But it could work if there were dropdown or radio buttons next to it such as:

Source URL, I Wrote This, Other.

I’d vote for Discourse.

This is taken care of.

The new thread (or another one) could also be a good place to document permissions to use non-public-domain work for the project to avoid future misunderstandings, wouldn’t it?

I got one last week from the Esperanto web magazine liberafolio.org by one of its founders Kalle Kniivilä. The magazine exists since 2003, I am importing the sentences in chunks with articles from one year each. They always contain around 800 sentences. So I expect that this can add around 10 000 new sentences in a high quality to the dataset. The site itself is licensed under CC BY 4.0 but it is okay for them if we publish unconnected sentences under CC0. I got the permission by mail and archived the mail. I always add the information that I got a permission when I import sentences from this source.

Should we have a different place to collect such permissions or do you agree that this fits well into the “copyright issues” thread? Or is it enough to just write it into the copyright field in the sentence-collector?

