[Technical feedback needed] Wikipedia extractor script beta

nukeador · September 4, 2019, 12:52pm

These are great advances @txopi

You might be able to play with the rules files and the blacklist to avoid Roman ordinals. Other people in this topic would be able to help with the regex.

Once you have a set of rules and blacklist that produce an output that is rated as <7% error rate by 2-3 native speakers, feel free to open a PR adding the following information:

How many sentences are you getting?
How did you create the blacklist? (specify the criteria, i.e words with <80 repetitions)
Get 2-3 additional native speakers (ideally some linguistics) to comment here with the estimated error rate. You can share with them a few samples of 500 random sentences from your output.

Cheers.

Topic		Replies	Views
Future of the Sentence Extractor - Your input is required Common Voice sentence-collection	11	1839	May 28, 2021
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	619	August 12, 2024
Question about CV Sentence Extractor quality and your experience Common Voice	18	1575	August 30, 2023
About the new English Sentences Common Voice feedback , issue	37	3346	May 31, 2019
Scraper - Automatic sample sentences extracted in Pull Request Common Voice	1	1571	March 5, 2020

[Technical feedback needed] Wikipedia extractor script beta

Related topics