Does United Nations Parallel Corpus clear of license issue?

I notice the United Nations corpus mentioned in this post is public domain as stated in the first page of the web site. I am not a copyright expert. May I know whether this corpus is clear of license issue?

There are millions of sentences in the corpus. My community would be crazily happy to have it import to the sentence repository.

1 Like


I can check with our legal team in our next meeting, but on their terms of use I read:

When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information

From past inquiries my understanding is that attribution is not compatible with CC-0 Public Domain license.

I’ll let you know, cheers.

1 Like

Thank you very much.

The EU Parliament corpus has a similar statement at the front page: “Please cite the paper, if you use this corpus in your work”. As this corpus is fine for CV, hope the UN one is also fine.

The corpus is not public domain obviously. The result of cleaning and sorting out the sentences from PD documents into a corpus can absolutely be licensed. You can do whatever you want with PD materials, including license the work out of it.

We should look for those PD documents which the corpus originally scratched from.

A post was split to a new topic: New Zealand parliament corpus

Coming back after consultation with our legal team.

No, unfortunately we can’t use this corpus. Anything with an attribution requirement is not consistent with CC0 (this is why CC-BY can’t go into Common Voice).

About if this corpus should be usable because it is similar to an EU corpus:

There’s an important difference. The EU corpus comes with a “please” give attribution request - this one comes with a “must” give attribution requirement.

Thanks @nukeador Time to continue collecting sentences :cry:

I brought the idea of using this corpus, and I still believe we can use it. Here is why I think is true.

UN documents is public domain, we can all agree on that.

With respect to database and copyright rights, the whole corpus is protected by both. There is copyright in marking parallel corpus, this is work protected by copyright, but we don’t use it as parallel corpus. The mere job to OCR UN documents can’t create additional copyright, as it doesn’t have threshold of originality (e.g. it’s a mechanical produced work with some proof-reading, but proof-reading alone can’t create additional copyright ). As for sui generis database rights employed in EU, our use of corpus pass fine with it. The database right on this corpus can be employed only if we reproduce parallel properties of it, as just mere collection of UN documents doesn’t create additional database rights as it just a reproduction on UN database (e.g. for database rights there is need to be independent database sources).

Even if this all enough that we rule our usage of that corpus comply with PD requirements as we just use PD UN documents and not an derivative work of UN documents as such doesn’t have required threshold of originality, the terms on corpus website seems pretty permissive, there is no even requirement to provide a copy of it or provide attribution to the UN (which we do with source field, and this is not required anyway as it’s PD):

The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):

  • The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus.
  • Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the user’s sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the user’s sole and exclusive remedy is to discontinue using the United Nations Corpus.
  • When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
  • Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved.

So there is absolutely no reason why we can’t use it.

@pdsfjd our legal team reviewed the terms on the UN site, and as I commented previously:

Unless this is changed on the source of this corpus (the UN site) we won’t be able to use it, we prefer not incur in any potential legal risks.

Thanks for your understanding.