Question about CV Sentence Extractor quality and your experience

mkohler · September 13, 2022, 9:45pm

The initial extract was done back in 2019: https://github.com/common-voice/common-voice/commit/e4a685eab1f90d39e64e7fba39826be2f7a738cd

Back then, the Sentence Extractor did only have rudimentary rules, and therefore many improvements done since then are obviously not reflected in the initial extract. There were a lot of learnings since then, including the block list and other more granular rules that are now possible. What we could do is a new extract of the new articles since then, but that would first require an update to the rules file in the Sentence Extractor: cv-sentence-extractor/src/rules/fr.toml at main · common-voice/cv-sentence-extractor · GitHub.

Well, not at the current state. The WikiExtractor does not give us much information on the article, so being selective there without further implementation of additional API calls won’t be possible.

Topic		Replies	Views
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	585	August 12, 2024
Future of the Sentence Extractor - Your input is required Common Voice sentence-collection	11	1823	May 28, 2021
About the new English Sentences Common Voice feedback , issue	37	3316	May 31, 2019
[Technical feedback needed] Wikipedia extractor script beta Common Voice sentence-collection , feedback	76	8419	July 1, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3681	September 11, 2019

Question about CV Sentence Extractor quality and your experience

Related topics