Updated fund 2 errors in filter.sh
script
I have extracted a bunch of Danish sentences from the Europarl dataset.
cat europarl-v7.da-en.da | source filter.sh >europarl-filter-da.txt
shuf europarl-filter-da.txt | ruby uniq_stem.rb >europarl-v7-da.txt
where the filter.sh
script is
grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' | # white-list letter and symbols
grep -v '.[A-ZÆØÅ]' | # remove names(starting with uppercase letter) from inner text
awk 'length<102' |
awk 'length>25' |
awk 'NF<=14' |
grep -P '^[A-ZÆØÅ]' | # only lines that starts with uppercase latter
grep -P '[a-zæøå0-9][.?!…]*$' | # only lines that ends with a word and a end symbol
grep -P '[a-zæøå][a-zæøå][a-zæøå]' |
grep -v '[a-zæøå]\..' | # remove abbreviation like hr. and stk.
grep -v -P ' (hr|nr).$' | # remove abbreviation in the end.
sed -r 's/ +/ /g' |
sed -r 's/ ?'"'"' s /'"'"'s /g' |
grep -v -P '(bomb|dræb|drab|tortur|terror|myrd)' |
grep -v ' dør[ ,.]' |
grep -v ' kl\.' |
hunspell -d da_DK -L -G | # spell check to filter out misspellings
sort |
uniq
I started with the Dutch script and work from there. The first line grep -v '[^a-zæøåA-ZÆØÅ,.?! ]'
is a white-list, that tells witch symbols is allowed in this collection of sentences. This is super effective in removing all strange symbols from the lines, also I remove numbers, since the file have a lot of case numbers and similar.
Also I removed abbreviations and words related to war.
Next I applied a ruby script called ‘uniq_stem.rb’ that looks like this
# coding: utf-8
stemcount=Hash.new(0)
while gets
t=$_
s=t.gsub(/[^\s\w\dæøåÆØÅ]/,' ') # make unusual symbols into spaces
.gsub!(/[\s]+/,' ') # remove multiple spaces
.downcase # make all letters lowercase
.split(' ')
.delete_if { |w| w.to_i>0} # remove numbers
.collect { |w| w[0..3]} # only keep stem part
mc=s.collect {|w| stemcount[w]}.min
# puts [mc,s].inspect
if mc<1
puts t
s.each { |w| stemcount[w]+=1}
end
end
# puts [stemcount.length,stemcount].inspect
It lets a sentence parse through if it contain a stem (first 4 symbol of a word) that is not seen before. This gives a lot of variation in the sentence collection.
Hope it is useful.