Hey @mkohler, I’m trying cv-sentence-extractor for Turkish. There are some more-or-less technical questions.
Q1. As far as I can see, when set, the broken_whitespace
rule leaves sentences with sub-strings like " !" filtered out. Shouldn’t it fix them instead?
Q2. I could not see any construct (e.g. regex) to replace multiple spaces, do I miss it (Rust is new to me)?
Q3. I’m having trouble with the apostrophe’s. I used cvtools
's word_usage.py
as instructed to get the word frequencies. I used it with --strip-apostrophes
setting. In Turkish, proper names with suffixes are divided by them, like [Bülent’in], [Bülent’e], [Bülent’lerden] etc. (Although there is a “clean” function which replaces possible Unicode alternatives, some pass through…)
This part is most important as I use dictionaries and stemmers to white-list words to get a black-list (implemented in an external repo).
How can I handle similar stuff in cv-sentence-extractor (wrt. apostophes)?