hi, this is my first post here, so please, don’t crunch me
I started working on Slovak dataset as I didn’t find a lot there, so I made an extract from the corpus. I made 3 step process: at first a bash script reusing and modifying some of the rules above, then a Python script for shuffling and sample extraction and then I just copied the result to a spreadscheet. I may rewrite the first 2 steps in Python only for I think easier/more robust re-use:
cat europarl-v7-sk.txt |
sort -u |
grep -v "[:()]" |
grep "^[0123456789\"-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZáčďéíľĺňóôöŕšťúýžÁČĎÉÍĽĹŇÓÔŔŠŤÚÝŽ,.?! ]*$" |
grep "^[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉÍĽĹŇÓŔŠŤÚÝŽ]" |
awk "NF<=14" |
awk "NF>=3" |
grep -v " the " |
grep -v "The " |
grep -v "\.\.\." |
grep -v ",$" |
grep -v " -" |
grep -v "[\-\']" |
grep "[.?\!]$" |
grep -v "[0-9\/&]" |
grep -v "[A-Z][A-Z]" > test.txt
python:
import numpy as np
sample_size = 4067
sentences = open("test.txt",'r').readlines()
sentences = np.array(sentences)
for i in range(10):
np.random.shuffle(sentences)
with open("shuffled.txt",'w') as outfile:
for line in sentences:
_ = outfile.write(line)
rnd = np.random.randint(0, len(sentences),size=sample_size)
with open("sample.txt", 'w') as sample:
for idx in rnd:
_ = sample.write(sentences[idx])
I haven’t started with evaluation yet, so if there are any Slovaks here, I’ll be happy for your help here: sample spreadsheet
Also, this is a first version/draft, so if you encounter many problems/issues, please let me know. Also, I noticed quite a lot of sentences that could “easily” be fixed were removed with my process, but this requires extra manual labor (re-writing numbers to text or such)…
Also, I found another source for some texts from our government (one can filter based on license & file type): CC0 datasets
But I saw only PDFs for (somewhat) relevant documents, so it’s questionable how difficult it’ll be to extract text from these.
Thank you!