Using the Europarl Dataset with sentences from speeches from the European Parliament

Fjoerfoks · December 15, 2019, 1:23pm

Interesting, will take a look at the Dutch sentences.
Concerning validating, swiping is nice, but you need a touch screen. On a desktop/laptop I’d like to be able to have more sentences like 10 or 20 on 1 page, have a “Check all” option and validate, instead of clicking all one by one which is tedious.

stergro · December 15, 2019, 5:57pm

I started a Thread in the german section of this forum about this issue to discuss what the german community wants:

Concerning the sentence collector:

True, I would love to have this too.
You can use the Selenium IDE Browser plugin to automate smaller work steps for now. You can record and play clicks on a website with it. For example when all good sentences are successfully reviewed and only the bad sentences with one downvote are left, then you can use it do automate the second downvote. But be careful with it!

Fjoerfoks · December 16, 2019, 7:48am

I know, I have used iMacro to automate tasks in Firefox and it works very well. In this case I first need to read (only) 5 sentences before I can hit a macrobutton to validate, if the sentences are correct. I read faster than hitting 5, 10 or 20 buttons, so it would be nice to have more sentences on 1 page and click 1 button to validate.

nukeador · December 16, 2019, 2:02pm

I would ask to avoid any kind of automation tool for the sentence collector. The whole point of the tool is to enforce human review of each sentence to ensure quality or we will end up with a bad corpus for voice collection, delaying the whole process.

If you have a big public domain corpus (> 500K) coming from a trusted source, please reach out independently to me and we can figure out a different QA process than the sentence collector. But note we currently don’t have the team bandwidth to have a process for smaller corpus that ensure the high quality we are looking for.

Thanks!

nukeador · December 16, 2019, 3:10pm

Also, as I commented over Slack, we probably want to remove the 60K Dutch sentences from the collector and see if we can follow a QA process for all languages that doesn’t involve individual review from large and trusted sources of text.

stergro · December 17, 2019, 7:12pm

I created a pull request for the German corpus:

Anyone who wants to help with the review process is welcome to help

This corpus has around 379 k sentences after cleanup. Am I too quick here, would you prefere another process?

nukeador · December 18, 2019, 12:09pm

Yes, let’s have a separate process here, this is too complex to just have it on a PR.

I’ll reach out directly to explore which options we have for this corpus.

FremyCompany · December 20, 2019, 10:07am

Sounds good. I personally reviewed a thousand sentences out of the 60K I added for Dutch (so more than 1.5% of then), and didn’t find a single sentence that would be bad to have in the corpus. The two worst sentences I found were a sentence where a space was missing (“overPakistan” instead of “over Pakistan”) and another one that said without context “nuclear plants are bombs waiting to explode” but honestly I don’t think anybody would be truly upset if either sentence ended up in the dataset. So, I validated all the sentences for myself, but was planning on letting a second review happen, but if there’s a different process I’m fine with this as well. I can also provide longer sentences from that dataset, similarly to what has been done for German, if the German sentences are deemed ok (they should all be translations of each other in the end).

nukeador · December 20, 2019, 12:18pm

@FremyCompany I’m working with @stergro for the German ones, to avoid overloading the sentence collector.

I think we should take an unified approach for this corpus and applied to all languages. How many sentences were available for Dutch?

FremyCompany · December 23, 2019, 9:47am

The same amount as in German, I’d say. There are only 60K in the sentence collector because I focused on a strict set of rules, while German sentences were selected more generously from the dataset, but if the sentences from German are deemed ok, I can apply the same filtering rules as them and get approximately the same amount of sentences, since those sentences are translations of each other.

nukeador · December 26, 2019, 12:32pm

For German we are talking to almost 500K sentences. I would prefer if we can do something similar for Dutch outside the sentence collector.

My advise for Dutch would be:

Make sure/help with at least the wikipedia process is finished to quickly get a lot of diverse sentences. Talk with @Fjoerfoks who is leading this effort.
Wait until we see with @stergro how to handle the Europarl dataset so we can run a similar QA process with other languages.

Cheers.

stergro · December 30, 2019, 9:56pm

Hey @nukeador What are the next steps now? We talked about a possible process with the excel review sheets and I think it sounds like quite some work but doable. Reviewing a few thousand sentences for statistically sound results is still better than reviewing 500k sentences. Is it okay if I just prepare such a sheet for the German corpus and we see how it goes? Maybe you can explain the process again to the group so that everyone knows what we are talking about.

nukeador · January 2, 2020, 12:33pm

Yes, let’s kick-off the process we talked in private and see how it goes. Once that’s done we can share back with everyone and see how to do it with other languages.

Thanks!

stergro · January 8, 2020, 11:23am

Alright, I will start to review the sentences, everyone who wants to help finds the link to the sheet here:

stergro · January 23, 2020, 7:57am

Thanks to the great help of @benekuehn and other helpers from the german forum the 4000 sentences are reviewed now. 94.25% are fine, 2.10% have spelling errors, most of them are caused by the german spelling reform that happened in 1996. Another 3.05% are hard to pronounce, mainly names and political words.

FremyCompany · January 31, 2020, 11:04pm

Hey all,
What is the status on this effort at this point?

stergro · February 1, 2020, 10:32am

For the German import the review is done and this is the pull request waiting to be merged or refused:

nukeador · February 3, 2020, 1:07pm

We need to get green light from our team in charge of dataset quality as well as legal review to be fully sure we can use this content under CC0.

stergro · March 6, 2020, 8:54am

The pull request is mereged now:

How Did you solve the attribution @nukeador ? Are you now generally ready to import more languages?

Now that the wiki-scraper is capable of filtering all kind of sentence collections it should be easy to import more languages from this corpus.

nukeador · March 6, 2020, 12:07pm

Yes, I’m working with @phirework to have this merged and attributed

EDIT: This is now merged.

@stergro can we get an extraction for other languages we can ping communities to do the QA?