For past few days I have been working on acquiring non-licenced texts from polish literature that are hosted on wolnelektury.pl. I have downloaded around 2700 works that are public domain and from let’s say modern periods. This is a lot of text to process.
I am not a natural language processing expert and my initial idea was to process it this way:
- split texts to sentences using nlp segmenter
- filter out too long sentences
- for each sentence check what is the least common word and if it falls below some rarity threshold rule out sentence as well
For splitting I used external program which did quite well but still left some broken sentences (hard to tell how much exactly). At the moment I am in the middle of last point. To measure this rarity factor of a word I search for it in online polish language corpus data base and check how many hits it has. If it is too rare I drop the sentence. So far I think I made some good progress but I think it is not good enough. I had an idea to actually pick some portion of those sentences and drop it into collection tool and count on review phase to do the job but I am somewhat hesitant towards this idea. There is a lot of sentences which are wrong on first sight, but to filter them all out by some set of rules seems very time consuming for me and I don’t have much spare time.
Also to give some numbers I made it through roughly 1/6 of texts and have 70k of sentences with about 25k filtered out due to rarity and another 25-30k too long. I assume out of this 70k even 30-50% may be wrong due to format, context or any other reason, and we still have a lot of text to review by hand. And that’s about 1/6 of the whole text base that I have.
Can anyone help me with that? Any ideas or activities are welcome. I am planning to push those files to Github sometime next week once I complete my initial processing.