Question about CV Sentence Extractor quality and your experience

mkohler · August 18, 2022, 5:32pm

I think with German we’re below 2% at this point. In the end you can do quite a lot with the rules and the blocklist. I think it’s possible to get even further down, but that might start to result in lost sentences and obviously needs time to implement. In the end no process is perfect.

You can download and run the WikiExtractor once, and then change the rules and only run the Sentence Extractor from then on. That definitely saves time. You can also only run the Sentence Extractor for a few seconds. In the end I would suggest to look at patterns you can eliminate rather than sentence by sentence, that would take too much time. Most of the patterns are getting clear with the first run already and then it comes to fine-tuning.

In the GitHub Action (official script) it’s randomized. However locally you can change this line here to be std::usize::MAX instead of 3 to get all sentences, if that helps with the rules. Then it should be deterministic.

It basically checks how many times a certain word exists in the full text and then adds any words below your defined threshold to the block list. Try it out, but there certainly is a chance this won’t work for this case. Would love to hear the result of this.

Topic		Replies	Views
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	637	August 12, 2024
Future of the Sentence Extractor - Your input is required Common Voice sentence-collection	11	1863	May 28, 2021
[Technical feedback needed] Wikipedia extractor script beta Common Voice sentence-collection , feedback	75	8987	July 1, 2020
About the new English Sentences Common Voice feedback , issue	37	3535	May 31, 2019
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3763	September 11, 2019

Question about CV Sentence Extractor quality and your experience

Related topics