Planning one more pass on the English wiki text and then I’ll start on Europarl for English.
Great, I won’t start with any of this before next week, so feel free to be quicker
Given that things seem settled on this, I went ahead and added a subset of the Dutch sentences, following the guidelines used for German, and adding a couple of other restrictions myself.
The pull request is there: https://github.com/mozilla/voice-web/pull/2643
Merging this is especially relevant for Dutch, because Common Voice has, by now, multiple recordings of all existing sentences already. More diversity would really be nice, and this is a good measure to get there, while the Wikipedia thing gets sorted out.
Have been playing with the Danish part today. Lets see what I can do tomorrow.
Congratulations to the Dutch community for getting the EuroParl corpus cleaned up and incorporated (250K sentences) https://github.com/mozilla/voice-web/pull/2643
As I commented on the PR, it would be good if everyone who has been cleaning-up their Europarl corpus work together to publish a guide for other communities to do the same.
- What are the steps you took to get it done?
- What tools did you use and how?
- What other tips and tricks you learned through the process?
Thanks!
Updated fund 2 errors in filter.sh
script
I have extracted a bunch of Danish sentences from the Europarl dataset.
cat europarl-v7.da-en.da | source filter.sh >europarl-filter-da.txt
shuf europarl-filter-da.txt | ruby uniq_stem.rb >europarl-v7-da.txt
where the filter.sh
script is
grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' | # white-list letter and symbols
grep -v '.[A-ZÆØÅ]' | # remove names(starting with uppercase letter) from inner text
awk 'length<102' |
awk 'length>25' |
awk 'NF<=14' |
grep -P '^[A-ZÆØÅ]' | # only lines that starts with uppercase latter
grep -P '[a-zæøå0-9][.?!…]*$' | # only lines that ends with a word and a end symbol
grep -P '[a-zæøå][a-zæøå][a-zæøå]' |
grep -v '[a-zæøå]\..' | # remove abbreviation like hr. and stk.
grep -v -P ' (hr|nr).$' | # remove abbreviation in the end.
sed -r 's/ +/ /g' |
sed -r 's/ ?'"'"' s /'"'"'s /g' |
grep -v -P '(bomb|dræb|drab|tortur|terror|myrd)' |
grep -v ' dør[ ,.]' |
grep -v ' kl\.' |
hunspell -d da_DK -L -G | # spell check to filter out misspellings
sort |
uniq
I started with the Dutch script and work from there. The first line grep -v '[^a-zæøåA-ZÆØÅ,.?! ]'
is a white-list, that tells witch symbols is allowed in this collection of sentences. This is super effective in removing all strange symbols from the lines, also I remove numbers, since the file have a lot of case numbers and similar.
Also I removed abbreviations and words related to war.
Next I applied a ruby script called ‘uniq_stem.rb’ that looks like this
# coding: utf-8
stemcount=Hash.new(0)
while gets
t=$_
s=t.gsub(/[^\s\w\dæøåÆØÅ]/,' ') # make unusual symbols into spaces
.gsub!(/[\s]+/,' ') # remove multiple spaces
.downcase # make all letters lowercase
.split(' ')
.delete_if { |w| w.to_i>0} # remove numbers
.collect { |w| w[0..3]} # only keep stem part
mc=s.collect {|w| stemcount[w]}.min
# puts [mc,s].inspect
if mc<1
puts t
s.each { |w| stemcount[w]+=1}
end
end
# puts [stemcount.length,stemcount].inspect
It lets a sentence parse through if it contain a stem (first 4 symbol of a word) that is not seen before. This gives a lot of variation in the sentence collection.
Hope it is useful.
@stergro Can we update the first message on this topic to note all the information we need if other languages are doing the extraction?
- Total number of sentences
- Source of the sentences
- What is the clean-up applied? (including our general criteria for sentences)
- How many sentences were reviewed as part of the QA (4000 in this case)
- Link to the spreadsheet with the QA.
- Error rate of the QAed sentences.
Example of this information on the German PR
Any links to the tools you have been using for the extraction and clean-up.
Thanks!
Hi,
There is no Catalan data on Europarl Dataset, but months ago we translated Spanish data to Catalan using Apertium (an OSS machine translation engine). From 2,123,835 Spanish source strings, we get 1,965,734 Catalan translated strings.
I wonder if, after filtering them and doing QA, is It possible to import to Common Voice these machine translated strings from Europarl Dataset?
I just updated the first post with a table of all languages that have already imported the corpus and some information about the QA-process.
We have been avoiding direct translation of other languages text because we can’t ensure the quality and ease of read to be the same as the source.
I would say that first we would need the Spanish one QAed and incorporated, and then QA the quality of this machine translation, to make sure the quality of the source is good enough first and doesn’t impact your derivated one.
What you ask for is already done. The Spanish Europarl corpus we use has more than 10.000 sentences fixed (spelling, grammar, typography, codification errors). We can share these corrected Spanish sentences, and then do the QA of Spanish-Catalan sentences in parallel. Is that OK for you?
This looks good, how many sentences are we getting after clean-up for Spanish and Catalan?
I will complete the analysis and clean-up and will report the results.
Hey, I would like to find a way to import all 20 languages or at least all languages with an existing rule file in the cv-sentence-extractor. I think I could start to do the work for English but I would love to find a more general way to import the sentences because I don’t want to filter a language I do not speak.
What are your thoughts on this? Should we push forward to import all existing languages or should we stay in the work mode where languages get imported when someone is motivated to do the work for one language?
If we can try and extract all languages, it would be good to avoid technical knowledge requirement for some communities. Then we would just need to ask native speakers to review the result.
@mkohler thoughts?
Extracting all might get tricky, but if languages have an existing rules file for the Sentence Extractor I can see this working quite nicely. As the Sentence Collector would in theory support adding a new source, I’d say we should go down that road. More info on that here: https://github.com/Common-Voice/cv-sentence-extractor#adding-another-scrape-target
I haven’t looked yet at the data source structure, but if it’s straightforward, I’d be fine with adding the fetch and prepare code in the sentence-extractor as well, so that everything is in one place.
The other question then is how to trigger the extraction job. For now we only trigger it on merges, but don’t have a trigger for anything that wouldn’t have a PR. I’ll think about that.
hi, this is my first post here, so please, don’t crunch me
I started working on Slovak dataset as I didn’t find a lot there, so I made an extract from the corpus. I made 3 step process: at first a bash script reusing and modifying some of the rules above, then a Python script for shuffling and sample extraction and then I just copied the result to a spreadscheet. I may rewrite the first 2 steps in Python only for I think easier/more robust re-use:
cat europarl-v7-sk.txt |
sort -u |
grep -v "[:()]" |
grep "^[0123456789\"-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZáčďéíľĺňóôöŕšťúýžÁČĎÉÍĽĹŇÓÔŔŠŤÚÝŽ,.?! ]*$" |
grep "^[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉÍĽĹŇÓŔŠŤÚÝŽ]" |
awk "NF<=14" |
awk "NF>=3" |
grep -v " the " |
grep -v "The " |
grep -v "\.\.\." |
grep -v ",$" |
grep -v " -" |
grep -v "[\-\']" |
grep "[.?\!]$" |
grep -v "[0-9\/&]" |
grep -v "[A-Z][A-Z]" > test.txt
python:
import numpy as np
sample_size = 4067
sentences = open("test.txt",'r').readlines()
sentences = np.array(sentences)
for i in range(10):
np.random.shuffle(sentences)
with open("shuffled.txt",'w') as outfile:
for line in sentences:
_ = outfile.write(line)
rnd = np.random.randint(0, len(sentences),size=sample_size)
with open("sample.txt", 'w') as sample:
for idx in rnd:
_ = sample.write(sentences[idx])
I haven’t started with evaluation yet, so if there are any Slovaks here, I’ll be happy for your help here: sample spreadsheet
Also, this is a first version/draft, so if you encounter many problems/issues, please let me know. Also, I noticed quite a lot of sentences that could “easily” be fixed were removed with my process, but this requires extra manual labor (re-writing numbers to text or such)…
Also, I found another source for some texts from our government (one can filter based on license & file type): CC0 datasets
But I saw only PDFs for (somewhat) relevant documents, so it’s questionable how difficult it’ll be to extract text from these.
Thank you!
I prepare Polish dataset based on Europarl parallel corpus, I wrote details in new topic Polish dataset from Europarl - help needed since I need a help with QA. I think this dataset is sometimes specific, but most sentences are useful, I hope. Besides some “political” language there is many common sentences in modern Polish and common geographical names.
Just a little reminder that this is still an option for 20 European languages.