Using OSCAR corpus sentences

I wonder if we can import sentences from OSCAR project to Common Voice. Content is crawled from Internet, but it’s packaged and released under CC0.

1 Like

I experimented with this corpus a while ago. It is hard to filter sentences or add line breaks after full stops because the files are all huge. But it is possible to use it and you can get many good sentences out of it.

I am not so sure if using this could lead to copyright problems. One thought I had, was just using sentences shorter than four words or so. It is very hard to claim copyright at such short sentences and the dataset often lacks short and easy sentences right now.

Thanks for the pointer - didn’t know of OSCAR until now. Haven’t checked out the HUMONGOUS (!) files neither… yet.

But regarding the CC0 related initial question: I would be rather careful! The dataset itself might be public domain, but it’s (single) pieces of content might not! If there is not a clear and transparent documentation on the crawling process, I would rather assume that it is illegal in most of the jurisdictions or countries (at least in Europe).

Crawling a website without permission is not allowed, afaik, if you use the contents for your own works and not for display/browsing only. But I’m not a lawyer or legal expert and copyright and IP law is one big minefield - in most countries…

Unfortunately we can’t use OSCAR, as you pointed out:

When we reviewed it with our legal team we were advised to not to use it because of this legal risk.

Ok, I understand it. What a pitty :frowning:

Could you ask the legal if short sentences, like four words or fewer could be a possibility? There are exceptions for small passages in almost every copyright of the world.

Right now we mainly have long and complicated sentences, so mixing short sentences could benefit the dataset. Short sentences can be reviewed quickly in the sentence collector if you don’t want to risk automated imports.

Let me check that again with our legal team, thanks!

OK, so I got back from our legal team and they advise not to do it even for short sentences, it still introduces some risks we don’t feel comfortable with and it’s better to avoid.

Cheers.

1 Like