Using the Europarl Dataset with sentences from speeches from the European Parliament

Alright, I will start to review the sentences, everyone who wants to help finds the link to the sheet here:

Thanks to the great help of @benekuehn and other helpers from the german forum the 4000 sentences are reviewed now. 94.25% are fine, 2.10% have spelling errors, most of them are caused by the german spelling reform that happened in 1996. Another 3.05% are hard to pronounce, mainly names and political words.

Hey all,
What is the status on this effort at this point?

For the German import the review is done and this is the pull request waiting to be merged or refused:

We need to get green light from our team in charge of dataset quality as well as legal review to be fully sure we can use this content under CC0.

The pull request is mereged now:

How Did you solve the attribution @nukeador ? Are you now generally ready to import more languages?

Now that the wiki-scraper is capable of filtering all kind of sentence collections it should be easy to import more languages from this corpus.

Yes, I’m working with @phirework to have this merged and attributed :slight_smile:

EDIT: This is now merged.

@stergro can we get an extraction for other languages we can ping communities to do the QA?

@stergro - we added a note in the README for this specific source: https://github.com/mozilla/voice-web/blob/master/README.md#licensing-and-content-source

For other sources we’ll need to get legal to do a case-by-case review of their licenses to see how we want to handle it, but for now feel free to keep pulling things from Europarl.

1 Like

If I find the time I will prepare a PR for English and Spanish.

Looks like a good solution to me :slight_smile:

1 Like

Planning one more pass on the English wiki text and then I’ll start on Europarl for English.

1 Like

Great, I won’t start with any of this before next week, so feel free to be quicker :slight_smile:

Given that things seem settled on this, I went ahead and added a subset of the Dutch sentences, following the guidelines used for German, and adding a couple of other restrictions myself.

The pull request is there: https://github.com/mozilla/voice-web/pull/2643

Merging this is especially relevant for Dutch, because Common Voice has, by now, multiple recordings of all existing sentences already. More diversity would really be nice, and this is a good measure to get there, while the Wikipedia thing gets sorted out.

1 Like

Have been playing with the Danish part today. Lets see what I can do tomorrow.

1 Like

:tada: Congratulations to the Dutch community for getting the EuroParl corpus cleaned up and incorporated (250K sentences) https://github.com/mozilla/voice-web/pull/2643

2 Likes

As I commented on the PR, it would be good if everyone who has been cleaning-up their Europarl corpus work together to publish a guide for other communities to do the same.

  • What are the steps you took to get it done?
  • What tools did you use and how?
  • What other tips and tricks you learned through the process?

Thanks!

Updated fund 2 errors in filter.sh script

I have extracted a bunch of Danish sentences from the Europarl dataset.

cat europarl-v7.da-en.da | source filter.sh >europarl-filter-da.txt
shuf europarl-filter-da.txt | ruby uniq_stem.rb >europarl-v7-da.txt

where the filter.sh script is

grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' | # white-list letter and symbols 
grep -v '.[A-ZÆØÅ]' | # remove names(starting with uppercase letter) from inner text
awk 'length<102' |
awk 'length>25' |
awk 'NF<=14' |
grep -P '^[A-ZÆØÅ]' | # only lines that starts with uppercase latter
grep -P '[a-zæøå0-9][.?!…]*$' | # only lines that ends with a word and a end symbol
grep -P '[a-zæøå][a-zæøå][a-zæøå]' |
grep -v '[a-zæøå]\..' | # remove abbreviation  like hr. and stk.
grep -v -P ' (hr|nr).$' | # remove abbreviation in the end.
sed -r 's/  +/ /g' |
sed -r 's/ ?'"'"' s /'"'"'s /g' |
grep -v -P '(bomb|dræb|drab|tortur|terror|myrd)' |
grep -v ' dør[ ,.]' |
grep -v ' kl\.' |
hunspell -d da_DK -L -G | # spell check to filter out misspellings
sort |
uniq

I started with the Dutch script and work from there. The first line grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' is a white-list, that tells witch symbols is allowed in this collection of sentences. This is super effective in removing all strange symbols from the lines, also I remove numbers, since the file have a lot of case numbers and similar.
Also I removed abbreviations and words related to war.

Next I applied a ruby script called ‘uniq_stem.rb’ that looks like this

# coding: utf-8

stemcount=Hash.new(0)
while gets
  t=$_
  s=t.gsub(/[^\s\w\dæøåÆØÅ]/,' ') # make unusual symbols into spaces
     .gsub!(/[\s]+/,' ')         # remove multiple spaces
     .downcase                   # make all letters lowercase
     .split(' ') 
     .delete_if { |w| w.to_i>0}  # remove numbers
     .collect { |w| w[0..3]}     # only keep stem part
  
  mc=s.collect {|w| stemcount[w]}.min
  #  puts [mc,s].inspect
  if mc<1
    puts t
    s.each { |w| stemcount[w]+=1} 
  end
end  

# puts [stemcount.length,stemcount].inspect

It lets a sentence parse through if it contain a stem (first 4 symbol of a word) that is not seen before. This gives a lot of variation in the sentence collection.

Hope it is useful.

1 Like

@stergro Can we update the first message on this topic to note all the information we need if other languages are doing the extraction?

  • Total number of sentences
  • Source of the sentences
  • What is the clean-up applied? (including our general criteria for sentences)
  • How many sentences were reviewed as part of the QA (4000 in this case)
  • Link to the spreadsheet with the QA.
  • Error rate of the QAed sentences.

Example of this information on the German PR

Any links to the tools you have been using for the extraction and clean-up.

Thanks!

Hi,

There is no Catalan data on Europarl Dataset, but months ago we translated Spanish data to Catalan using Apertium (an OSS machine translation engine). From 2,123,835 Spanish source strings, we get 1,965,734 Catalan translated strings.

I wonder if, after filtering them and doing QA, is It possible to import to Common Voice these machine translated strings from Europarl Dataset?

I just updated the first post with a table of all languages that have already imported the corpus and some information about the QA-process.

1 Like

We have been avoiding direct translation of other languages text because we can’t ensure the quality and ease of read to be the same as the source.

I would say that first we would need the Spanish one QAed and incorporated, and then QA the quality of this machine translation, to make sure the quality of the source is good enough first and doesn’t impact your derivated one.