Using the Europarl Dataset with sentences from speeches from the European Parliament

EDIT: Since this thread got bigger than expected here is a summary of the project, that I will update from time to time. You can still find the old first post below.

This thread is about using a speech corpus from the EU Parliament. The dataset uses Speeches from 1996 - 2011 and is available in 20 (?) European languages. You can read about the details here: http://www.statmt.org/europarl/

To import the dataset you will have to filter the sentences and then do a review process to assure the quality of the data. Here is an overview of the languages that have already done this and how they have done it:

Language Merged Number of sentences Details
German Yes ± 370 000 Pull Request German , my collection of shell commands to filter sentences and the spreadsheets for the QA process
Danish No 5136 Danish pull request and the Scripts for the danish sentences extraction
Dutch Yes ±259 000 Dutch pull request and the Dutch QA spreadsheet
Czech Yes 98908 Czech Pull request and the QA spreadsheet
Polish Yes 205,024 Polish pull request

If you want to filter a new language you should have a look at the CV Sentence Extractor and see if it already supports your language.

For the QA Process @nukeador send these rules to me:

Manual review

After the automatic process most major issues should be solved, but there are always a few that can’t be fully automated and that’s why we need to ensure the following on the filtered output from the previous process:

  1. The sentences must be spelled correctly.
  2. The sentences must be grammatically correct.
  3. The sentences must be speakable (also avoiding non-native uncommon words)

Then 2-3 native speakers (ideally some linguistics) should review a random sample of sentences. The exact number depends on getting a valid sample size – you can use this tool with a 99% confidence interval and 2% margin of error. So, for 1,000,000 sentences, you would need a sample of 4,000.

Each reviewer should score each sentence on the dimensions above (points 1, 2 and 3), which will give you an indication of how many are passing the manual review.

You can use this spreadsheet template for a 100 sentences evaluation.

If the average error ratio from these people is below 5-7%, you can consider it good to go.

Please report following outputs to the Common Voice team:

  • How many valid sentences did you get at the end?
  • What is the error ratio?
  • Who reviewed the sentences and a link to the review spreadsheet(s)?

Original post:


Hey everyone,

today @FremyCompany imported 60k sentences in the Dutch sentence collector using transcribed sentences from speeches from the EU Parliament. The dataset uses Speeches from 1996 - 2011 and is available in many European languages. You can read about the details here: http://www.statmt.org/europarl/

You can also read some more details about the Dutch dataset and how it was selected on Slack. The most important part is:

I only selected sentences in that dataset which:

  • were between 5 and 8 words long
  • were not longer than 55 characters
  • started with an uppercase and ended with a dot
  • did not contain parentheses or semi-colons/double-points
  • did not have any capitalized word to the exception of the one starting the sentence

The last restriction is rather strict but it makes the sentences way less topically biased and avoids having to deal with proper names for the most part, which I guess will make the review process smoother.

Given the desired sentence length (about 5s), I would expect there will be little variation in all languages, so you should expect to find between 50k and 100k sentences for all of the 21 languages represented in the set.
The goal seems to be 1M sentences for each language, so this might get you at around 10% of the requirements for any of those languages.
There are however 2M sentences for Dutch, but most of these sentences are way too long. Finding a way to cut them into smaller chunks or training a language model on this dataset and generating new short sentences based on it might help get all the 21 languages across the line.

I experimented with the German sentences and after some filtering I now have a file containing more than a hundred thousand sentences ready to be used. Maybe the number will decline a little when I filter out more unsuitable sentences. But before I put all these sentences into the sentence collector I have some questions:

  • Should we really check these sentences with the standard process? At least two people have to read every sentence, this would occupy the sentence collector for quite some time.
  • How many sentences can be put into the sentence collector at once? Do I have to cut them into chunks?
  • Do you guys in general think that using this data is a good idea in this way?

I believe this could increase the diversity of the sentences in the big languages a lot. Right now they all have the same dry style from Wikipedia, we could add more natural sentences from speeches in many languages with this data. And this is also a great chance to increase the numbers for some of the smaller European languages that don’t have a big Wikipedia.

2 Likes

In theory Sentence Collector should be able to handle all at once. But you’d need to make sure that your browser tab doesn’t get suspended. I’d suggest doing batches of 10k to get regular feedback.

Thanks @mkohler, I will keep this in mind and keep working on the file, but I won’t put it in the SC for now.

Some more thoughts on the issue:

  • If someone is planing to use the English data he or she should keep in mind that euro english is considered a thing of its own that sometimes sounds slightly off for many native english speakers.
  • It might me better to check the sentences manually since some MEPs have pretty extreme positions that we might not want to see as a sentence in the system.
  • In the German dataset I cant simply delete all sentences with capitalized words, because this would delete all sentences containing substantives. This means that the sentences will contain a lot of names from all european regions (which maybe might not be a bad thing)

EDIT: If someone wants to get an impression, this is what I got so far: https://github.com/stefangrotz/common-voice-files-de-eo/blob/master/Deutsch/europarl-de-v7/europarl-de-wip.txt

I would also suggest to shuffle ... | shuf the sentences before uploading because it doesn’t seem the Sentence Collector is doing it for the review, and this will bias the reviews towards A sentences initially, it might be better for the order to be random.

Thanks @FremyCompany for that, do you have feedback on the French part of the dataset ? I’m quite interested. I’ve ignited Common Voice French with french parliament data, and some people were a bit disoriented with the content.

@stergro i’m not sure I understand, do you mean some members of the european parliament have a problem with those data being public ?

Sorry I didn’t look at the French part of the dataset, only the Dutch part. But the rules I enforced for the sentences (between 5 and 8 words, no capitalized words, etc…) yielded by far very neutral sentences. That’s why I used those rules in the first place, I don’t pretend this will generate very topic-driven sentences, but I assumed that this would be better for sentences that might otherwise express a lot of opinions.

Here is a random sample of a few sentences and their translations:

Ter afsluiting heb ik twee punten.
- To conclude, I have two points.

Dat voor wat betreft de gelijke kansen.
- And that for things relative to equal chances.

We hopen op dit punt vorderingen te maken.
- We hope to make some progress on that point.

Er is bijgevolg geen enkele reden tot ongerustheid.
- There is therefore not a single reason to be on our toes.

U bent de tegenovergestelde weg ingeslagen.
- You went in the opposite direction.

Ik wil nog één punt maken.
- I will state one more thing.

Ik wil de commissaris de volgende vragen stellen.
- I will ask the commissaris the following questions.

Ik blijf wel mijn twijfels houden.
- I still have my doubts.

Ik heb daar niet aan getwijfeld.
- I didn't hesitate for a second.

In sommige opzichten is dit duidelijk van belang.
- In some circumstances, this is clearly of interest.

Dit als suggestie voor de commissaris.
- That, as a suggestion for the commisaris.

Naar mijn mening is het absoluut vreselijk.
- In my opinion, this is absolutely terrible.

We mogen daar niet langer lichtvaardig mee omgaan.
- We will not be able to take this lightly for much longer.

Er wacht de fungerend voorzitter een enorme uitdaging.
- A dauntig challenge is awaiting the president in function.

Wij hebben voorts het tijdsplan herzien.
- We have made updates to the planning.

Het is een moeilijke en gecompliceerde zaak.
- This is a difficult and complex task.

De inhoudelijke redenen liggen trouwens voor de hand.
- The substantive reasons are rather obvious.

Dit hebben we bereikt in onze bilaterale betrekkingen.
- We achieved this in our bilateral negotiations.

That’s fine, I was curious because I wanted to make use of the dataset myself for french Common Voice, but never had time :slight_smile:

If you end up having 100K sentences or more, please let us know. I feel this source can be treated like Wikipedia, since we can assume the source is already reviewed.

No I was talking about the things they say in their speeches. I am not sure how much this really a problem in this dataset, but I know that a few MEPs used words like “scum” for certain groups in their speeches.

Good point, I will do that. I will also filter out sentences with letters that are not part of the German alphabet, this filters out many words that are hard to pronounce.

IMO the sentences are equal in quality to the sentences from wikipedia. But they can’t be used without preprocessing them. For example in the German dataset many sentences start with some letters indicating the original language (like EN: blabla). One should delete things like this first but after that it looks fine.

Right. I know in the past, the consensus was to remove that kind of potentially offending language. Now, we have tooling in place to report, and I think we have also ways for people to express “I don’t want to see offending language”.

Maybe @nukeador can complete my answer?

What’s the estimated percentage of problematic sentences?

Again, if we have a lot of sentences, we can run a QA process on them to understand this percentages, as we did for the wikipedia extraction.

1 Like

I’d be interested to take a look at the English side of things. But there doesn’t appear to be a standalone English version. Is the English content the same in all of the packages, so I could just choose one at random and extract the English translations? Or would I need to extract from all languages?

I am not sure. I compared the english file from the dutch and the german collection and the beginning of these files looks identical, but they don’t have the same size, the dutch one is much bigger. (297 mb vs 307 mb)

Edit: looks like the biggest file is the fr-en collection, but the english file there is just as big as in the en-nl collection.

After searching through the file for some typical topics I think the percentage of problematic sentences is not very high. There are a lot of sentences with strong opinions about all kind of political topics, but almost all of them use a acceptable language. I am for the QA process instead of the sentence collector.

Which might still not be what we might want to show on Common Voice though. Even if it’s acceptable language, the context within a sentence might be heavily opinionated and I personally think Mozilla should refrain from displaying potentially weird political issues. Of course some will be submitted through the Sentence Collector. Do we know of any way to filter out some potentially more far left/right politicians from those datasets? (This is my opinion and I’m totally fine if y’all decide differently)

An example (and that could also be about a far left topic, just what came to mind here):

“All foreigners are …” is bad language, “All foreigners should be deported” is not per se bad language, but still might create a weird dissonance for people on Common Voice. I’m sure some assume that the sentences are vetted “by Mozilla” and therefore would associate Mozilla with these sentences.

Just my 2 cents :slight_smile:

There are sentences live on the site right now along the lines of “He said [controversial statement]” or “He believed [controversial opinion]”.

Are these ok because they are referencing what a person said and not saying it directly as if it was a fact?

It’s a thin line, I fully agree there :slight_smile:

The new swiping-mode of the sentence-collector makes the review process much quicker and it would filter out the worst sentences. I would be willing to review maybe 10 000 sentences in German. (I already reviewed that much for the Esperanto sentence collection) We would need at least another 19 people doing the same to import the complete dataset for one language. Likely more since sentences need more than two votes when people disagree.

That being said I recommend everyone to download the dataset and search for some words, topics and phrases that come to your mind that could be problematic. As far as I can see it there are very few really problematic sentences.

In the Europarl dataset most controversial opinions are part of a longer sentence like: “Mister President I have to say that …” and this puts the opinion in a context that makes it easier to be read by someone who doesn’t like it. But there will be some people who will complain about some sentences since they are all highly political. But I could live with that.

Happy to hear that :slight_smile:

I didn’t review it, so if most of them are in this format or alike, I’m totally fine with a full import and relying on the reporting function.

Are there any notable reactions to controversial sentences that exists in the dataset right now? Did you guys get any angry mails yet?

Most sentences are only recorded by one person, so the impact of a bad sentence is likely not very high. One could also delete some topics with a blacklist as we go, based on the things we find over time.

Here is a sample file with 300 random English sentences from fr-en, the only thing I changed before creating this was deleting sentences longer than 14 words.: