Question about CV Sentence Extractor quality and your experience

I think with German we’re below 2% at this point. In the end you can do quite a lot with the rules and the blocklist. I think it’s possible to get even further down, but that might start to result in lost sentences and obviously needs time to implement. In the end no process is perfect.

You can download and run the WikiExtractor once, and then change the rules and only run the Sentence Extractor from then on. That definitely saves time. You can also only run the Sentence Extractor for a few seconds. In the end I would suggest to look at patterns you can eliminate rather than sentence by sentence, that would take too much time. Most of the patterns are getting clear with the first run already and then it comes to fine-tuning.

In the GitHub Action (official script) it’s randomized. However locally you can change this line here to be std::usize::MAX instead of 3 to get all sentences, if that helps with the rules. Then it should be deterministic.

It basically checks how many times a certain word exists in the full text and then adds any words below your defined threshold to the block list. Try it out, but there certainly is a chance this won’t work for this case. Would love to hear the result of this.

1 Like