Question about CV Sentence Extractor quality and your experience

The initial extract was done back in 2019: https://github.com/common-voice/common-voice/commit/e4a685eab1f90d39e64e7fba39826be2f7a738cd

Back then, the Sentence Extractor did only have rudimentary rules, and therefore many improvements done since then are obviously not reflected in the initial extract. There were a lot of learnings since then, including the block list and other more granular rules that are now possible. What we could do is a new extract of the new articles since then, but that would first require an update to the rules file in the Sentence Extractor: cv-sentence-extractor/src/rules/fr.toml at main · common-voice/cv-sentence-extractor · GitHub.

Well, not at the current state. The WikiExtractor does not give us much information on the article, so being selective there without further implementation of additional API calls won’t be possible.