[Technical feedback needed] Wikipedia extractor script beta

I’m aware that tokenization can be pretty bad. This is how I solved it for German: https://github.com/Common-Voice/common-voice-wiki-scraper/blob/master/src/rules/german.toml#L22. If you have any good ideas for this, please add them to Improve sentence separation · Issue #11 · common-voice/cv-sentence-extractor · GitHub.

It’s hard to get people to do that, so let’s avoid anything like that as far as possible.

Let’s give our best to avoid that.

IMHO it’s fine if you stop it after some time, I don’t think we need the full output for verification before we merge and run it officially.

Given the above, I don’t think this is needed. However there is Automatically extract sentences for validation for locales config files in a pipeline · Issue #18 · common-voice/cv-sentence-extractor · GitHub, but I doubt we can do it for the full export to make sure we can provide timely feedback. However, for smaller chunks we probably can do it and that should IMHO be enough for feedback for validation.

I’ll put this on my list again, will be done after Convert symbols / abbreviations to words · Issue #9 · common-voice/cv-sentence-extractor · GitHub and Dupe detection · Issue #14 · common-voice/cv-sentence-extractor · GitHub.