I notice the United Nations corpus mentioned in this post is public domain as stated in the first page of the web site. I am not a copyright expert. May I know whether this corpus is clear of license issue?
There are millions of sentences in the corpus. My community would be crazily happy to have it import to the sentence repository.
The EU Parliament corpus has a similar statement at the front page: âPlease cite the paper, if you use this corpus in your workâ. As this corpus is fine for CV, hope the UN one is also fine.
The corpus is not public domain obviously. The result of cleaning and sorting out the sentences from PD documents into a corpus can absolutely be licensed. You can do whatever you want with PD materials, including license the work out of it.
We should look for those PD documents which the corpus originally scratched from.
Coming back after consultation with our legal team.
No, unfortunately we canât use this corpus. Anything with an attribution requirement is not consistent with CC0 (this is why CC-BY canât go into Common Voice).
About if this corpus should be usable because it is similar to an EU corpus:
Thereâs an important difference. The EU corpus comes with a âpleaseâ give attribution request - this one comes with a âmustâ give attribution requirement.
I brought the idea of using this corpus, and I still believe we can use it. Here is why I think is true.
UN documents is public domain, we can all agree on that.
With respect to database and copyright rights, the whole corpus is protected by both. There is copyright in marking parallel corpus, this is work protected by copyright, but we donât use it as parallel corpus. The mere job to OCR UN documents canât create additional copyright, as it doesnât have threshold of originality (e.g. itâs a mechanical produced work with some proof-reading, but proof-reading alone canât create additional copyright ). As for sui generis database rights employed in EU, our use of corpus pass fine with it. The database right on this corpus can be employed only if we reproduce parallel properties of it, as just mere collection of UN documents doesnât create additional database rights as it just a reproduction on UN database (e.g. for database rights there is need to be independent database sources).
Even if this all enough that we rule our usage of that corpus comply with PD requirements as we just use PD UN documents and not an derivative work of UN documents as such doesnât have required threshold of originality, the terms on corpus website seems pretty permissive, there is no even requirement to provide a copy of it or provide attribution to the UN (which we do with source field, and this is not required anyway as itâs PD):
The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):
The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus.
Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the userâs sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the userâs sole and exclusive remedy is to discontinue using the United Nations Corpus.
Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved.
So there is absolutely no reason why we canât use it.