Using the Europarl Dataset with sentences from speeches from the European Parliament

nukeador · December 12, 2019, 2:13pm

What’s the estimated percentage of problematic sentences?

Again, if we have a lot of sentences, we can run a QA process on them to understand this percentages, as we did for the wikipedia extraction.

dabinat · December 12, 2019, 5:40pm

I’d be interested to take a look at the English side of things. But there doesn’t appear to be a standalone English version. Is the English content the same in all of the packages, so I could just choose one at random and extract the English translations? Or would I need to extract from all languages?

stergro · December 12, 2019, 6:49pm

I am not sure. I compared the english file from the dutch and the german collection and the beginning of these files looks identical, but they don’t have the same size, the dutch one is much bigger. (297 mb vs 307 mb)

Edit: looks like the biggest file is the fr-en collection, but the english file there is just as big as in the en-nl collection.

After searching through the file for some typical topics I think the percentage of problematic sentences is not very high. There are a lot of sentences with strong opinions about all kind of political topics, but almost all of them use a acceptable language. I am for the QA process instead of the sentence collector.

mkohler · December 12, 2019, 7:19pm

Which might still not be what we might want to show on Common Voice though. Even if it’s acceptable language, the context within a sentence might be heavily opinionated and I personally think Mozilla should refrain from displaying potentially weird political issues. Of course some will be submitted through the Sentence Collector. Do we know of any way to filter out some potentially more far left/right politicians from those datasets? (This is my opinion and I’m totally fine if y’all decide differently)

An example (and that could also be about a far left topic, just what came to mind here):

“All foreigners are …” is bad language, “All foreigners should be deported” is not per se bad language, but still might create a weird dissonance for people on Common Voice. I’m sure some assume that the sentences are vetted “by Mozilla” and therefore would associate Mozilla with these sentences.

Just my 2 cents

dabinat · December 12, 2019, 7:28pm

There are sentences live on the site right now along the lines of “He said [controversial statement]” or “He believed [controversial opinion]”.

Are these ok because they are referencing what a person said and not saying it directly as if it was a fact?

mkohler · December 12, 2019, 7:35pm

It’s a thin line, I fully agree there

stergro · December 13, 2019, 8:35am

The new swiping-mode of the sentence-collector makes the review process much quicker and it would filter out the worst sentences. I would be willing to review maybe 10 000 sentences in German. (I already reviewed that much for the Esperanto sentence collection) We would need at least another 19 people doing the same to import the complete dataset for one language. Likely more since sentences need more than two votes when people disagree.

That being said I recommend everyone to download the dataset and search for some words, topics and phrases that come to your mind that could be problematic. As far as I can see it there are very few really problematic sentences.

In the Europarl dataset most controversial opinions are part of a longer sentence like: “Mister President I have to say that …” and this puts the opinion in a context that makes it easier to be read by someone who doesn’t like it. But there will be some people who will complain about some sentences since they are all highly political. But I could live with that.

mkohler · December 12, 2019, 9:09pm

Happy to hear that

I didn’t review it, so if most of them are in this format or alike, I’m totally fine with a full import and relying on the reporting function.

stergro · December 13, 2019, 8:56am

Are there any notable reactions to controversial sentences that exists in the dataset right now? Did you guys get any angry mails yet?

Most sentences are only recorded by one person, so the impact of a bad sentence is likely not very high. One could also delete some topics with a blacklist as we go, based on the things we find over time.

stergro · December 13, 2019, 8:57am

Here is a sample file with 300 random English sentences from fr-en, the only thing I changed before creating this was deleting sentences longer than 14 words.:

github.com

stefangrotz/common-voice-files-de-eo/blob/master/English/en-sample.txt

That decision will certainly lead to public discontent.
I will neither confirm nor deny anything in this matter.
That is what we need in the European Union.
There is not an awful lot that can be done about that.
Graefe zu Baringdorf Report (A5-0079/2001)
We can then decide what tools to use.
This ‘monster’ has a monopoly on tabling bills.
I should like to refer very briefly to some of the points highlighted.
In the light of this, can we talk of free and fair competition?
Let me start with your first question.
Homelessness particularly affects the young, the old and other vulnerable groups in society.
However, the fact is that cities and their entire infrastructure are built by people.
Things are looking quite bad.
I am sorry to say that this year's report does not fulfil these requirements.
One example is the TEN-T rail transport project.
Why was this necessary?
MIKE (monitoring of illegal killing of elephants) does not work.
This also applies to the setting up of a reception scheme for displaced persons.
The facts will show this.
Here, the proposal to opt for two models is a sound one.

This file has been truncated. show original

Fjoerfoks · December 15, 2019, 1:23pm

Interesting, will take a look at the Dutch sentences.
Concerning validating, swiping is nice, but you need a touch screen. On a desktop/laptop I’d like to be able to have more sentences like 10 or 20 on 1 page, have a “Check all” option and validate, instead of clicking all one by one which is tedious.

stergro · December 15, 2019, 5:57pm

I started a Thread in the german section of this forum about this issue to discuss what the german community wants:

Concerning the sentence collector:

True, I would love to have this too.
You can use the Selenium IDE Browser plugin to automate smaller work steps for now. You can record and play clicks on a website with it. For example when all good sentences are successfully reviewed and only the bad sentences with one downvote are left, then you can use it do automate the second downvote. But be careful with it!

Fjoerfoks · December 16, 2019, 7:48am

I know, I have used iMacro to automate tasks in Firefox and it works very well. In this case I first need to read (only) 5 sentences before I can hit a macrobutton to validate, if the sentences are correct. I read faster than hitting 5, 10 or 20 buttons, so it would be nice to have more sentences on 1 page and click 1 button to validate.

nukeador · December 16, 2019, 2:02pm

I would ask to avoid any kind of automation tool for the sentence collector. The whole point of the tool is to enforce human review of each sentence to ensure quality or we will end up with a bad corpus for voice collection, delaying the whole process.

If you have a big public domain corpus (> 500K) coming from a trusted source, please reach out independently to me and we can figure out a different QA process than the sentence collector. But note we currently don’t have the team bandwidth to have a process for smaller corpus that ensure the high quality we are looking for.

Thanks!

nukeador · December 16, 2019, 3:10pm

Also, as I commented over Slack, we probably want to remove the 60K Dutch sentences from the collector and see if we can follow a QA process for all languages that doesn’t involve individual review from large and trusted sources of text.

stergro · December 17, 2019, 7:12pm

I created a pull request for the German corpus:

Anyone who wants to help with the review process is welcome to help

This corpus has around 379 k sentences after cleanup. Am I too quick here, would you prefere another process?

nukeador · December 18, 2019, 12:09pm

Yes, let’s have a separate process here, this is too complex to just have it on a PR.

I’ll reach out directly to explore which options we have for this corpus.

FremyCompany · December 20, 2019, 10:07am

Sounds good. I personally reviewed a thousand sentences out of the 60K I added for Dutch (so more than 1.5% of then), and didn’t find a single sentence that would be bad to have in the corpus. The two worst sentences I found were a sentence where a space was missing (“overPakistan” instead of “over Pakistan”) and another one that said without context “nuclear plants are bombs waiting to explode” but honestly I don’t think anybody would be truly upset if either sentence ended up in the dataset. So, I validated all the sentences for myself, but was planning on letting a second review happen, but if there’s a different process I’m fine with this as well. I can also provide longer sentences from that dataset, similarly to what has been done for German, if the German sentences are deemed ok (they should all be translations of each other in the end).

nukeador · December 20, 2019, 12:18pm

@FremyCompany I’m working with @stergro for the German ones, to avoid overloading the sentence collector.

I think we should take an unified approach for this corpus and applied to all languages. How many sentences were available for Dutch?

FremyCompany · December 23, 2019, 9:47am

The same amount as in German, I’d say. There are only 60K in the sentence collector because I focused on a strict set of rules, while German sentences were selected more generously from the dataset, but if the sentences from German are deemed ok, I can apply the same filtering rules as them and get approximately the same amount of sentences, since those sentences are translations of each other.