Help needed in processing large Polish text-base

Hello all,
For past few days I have been working on acquiring non-licenced texts from polish literature that are hosted on wolnelektury.pl. I have downloaded around 2700 works that are public domain and from let’s say modern periods. This is a lot of text to process.
I am not a natural language processing expert and my initial idea was to process it this way:

  • split texts to sentences using nlp segmenter
  • filter out too long sentences
  • for each sentence check what is the least common word and if it falls below some rarity threshold rule out sentence as well

For splitting I used external program which did quite well but still left some broken sentences (hard to tell how much exactly). At the moment I am in the middle of last point. To measure this rarity factor of a word I search for it in online polish language corpus data base and check how many hits it has. If it is too rare I drop the sentence. So far I think I made some good progress but I think it is not good enough. I had an idea to actually pick some portion of those sentences and drop it into collection tool and count on review phase to do the job but I am somewhat hesitant towards this idea. There is a lot of sentences which are wrong on first sight, but to filter them all out by some set of rules seems very time consuming for me and I don’t have much spare time.

Also to give some numbers I made it through roughly 1/6 of texts and have 70k of sentences with about 25k filtered out due to rarity and another 25-30k too long. I assume out of this 70k even 30-50% may be wrong due to format, context or any other reason, and we still have a lot of text to review by hand. And that’s about 1/6 of the whole text base that I have.

Can anyone help me with that? Any ideas or activities are welcome. I am planning to push those files to Github sometime next week once I complete my initial processing.

I know @dabinat has been doing some code work to filter sentences that might be useful here.

I should be able to add some volunteer thoughts when I get back this evening, as well.

It started out as something more general-purpose but turned into something a lot more specific to the English wiki import. However, some of the more general rules like word and sentence lengths may still be useful.

I’m not multilingual so I can’t help with the Polish import but feel free to fork my script for other languages: https://github.com/dabinat/cvtools/blob/master/sentence_validator.py

Hi @jakub.wrobel7, welcome!

Just as an initial comment, the team don’t accept new sentences posted directly to GitHub: they have to be uploaded to the Sentence Collector for volunteer checking and verification first. Once each sentence has been approved by two separate volunteers, it will be added to the list ready for people to read. There is more info at https://common-voice.github.io/sentence-collector/#/how-to, including details of words that won’t be accepted and that you should remove by script.

When uploading sentences, you have to specify their source. Whatever you specify for a single upload is permanently stored as a source record against every sentence of that upload. So, it seem the expectation is that each PD source should be a separate upload. I’m not sure whether it’s OK to mix 2700 public domain into a single upload.

Thank You @dabinat I will take a look at it maybe find something applicable to Polish as well.

Hello @Michael_Maggs, thanks for pointing all this out. I have already read the #howto and already made a few manual sentence donations to try out the collector tool. By writing about posting sentences to Github I meant to have some other (preferably polish contributors due to language) contributors to help with improving sentences quality before adding them to sentence-collector. This way I wanted to avoid manual labor during review as many sentences will not be really usable and could be refined with script.
As for 2700 works, I have them separated still and will upload this way, but it is important point You have mentioned.

I’ve been doing a similar exercise for PD novels in English from Project Gutenberg. I don’t find I need to filter out rare words, but if your sources are more specialist you may find that helps. After sentence tokenising, what I do is this:

  1. Reject sentences that will fail the Sentence Collector rules (eg numerals, words with multiple capital letters etc)
  2. Reject sentences that are too long. Up to 14 words are allowed, but as long sentences are much more likely to be read incorrectly, I prefer to select only up to 11 or 12.
  3. Massage quotation marks into standard form, and fix non-matching quote marks where possible
  4. Remove some of the part-sentences that are the result of errors by the sentence tokeniser, such as non-matching quotation marks that can’t be fixed automatically, sentences that start with a lower-case letter or end with an upper-case letter
  5. Update a few old-fashioned expressions (“to-day” -> “today” etc)
  6. Randomise the selected sentences before uploading. If you don’t randomise them, they’ll be reviewed in the Sentence Collector in the original order which can mean that they tend to look rather repetitive to the reviewers.

If anything is doubtful I remove it, as I think that we ought to be conserving volunteer reviewing time as much as possible.

I also have a script to replace regularly-repeated first names and surnames with a rotating list of several hundred common English names. The purpose is to seed the corpus with a good range of modern names that don’t often occur otherwise in older novels.

That are some good ideas. A few of them are already done in my script and it all ensures me that I am moving forward. Thank You.
Reducing sentence length to 11-12 seems very reasonable, probably wouldn’t think about it.
As for quotation marks I thought that with so many sentences it is ok to drop all of them if there is something suspicious.
The old fashioned expressions are the reason I am doing the rarity check. I think it would be quite hard to correct them due to how polish language works/worked.
This trick with names is brilliant.
For future reference I may add what I thought about not mentioned by @Michael_Maggs :

  1. Remove sentences ending with something different than . ! ? … this deals with many broken sentences.
  2. Remove sentences that does not start with capital letter or dialogue start and capital ( - I would never… )

I agree that we should reduce time needed to review each sentence, hence the topic.
Any more ideas welcome.

On your point 1, you may find valid sentences that end not only with those characters, but also with one or more quotation marks of various types. Sometimes the final full stop will be inside the ending quotation mark, sometimes outside.
He said “Hello”.
He said “Hello.”
He said, “the word you’re thinking of is ‘aardvark’”.
He said, “the word you’re thinking of is ‘aardvark’.”

Did you also see the English Sentence review guidelines? Discussion of new guidelines for uploaded sentence validation

Not until You mentioned it. Thanks a lot.