Bulk sentences submission from Wikipedia

Gweltaz_DG · August 2, 2024, 7:18am

Hi,

I’d like to submit a collection of sentences automatically retrieved and filtered from a Wikipedia dump, for the Breton corpus.
I’m in doubt if Wikipedia licensing is compatible with Common Voice requirements (CC0). Could anyone confirm ?

If Wikipedia is an accepted source, is it enough to put “Wikipedia” in the citation field, or should you give the complete URL to the pages for any given sentences ?

Thanks

bozden · August 2, 2024, 7:20am

Hey @Gweltaz_DG, welcome. I’m another volunteer here, who did use Wikipedia sources a year before, so I’ll try to help.

Wikipedia is not CC0, but AFAIK there is a “fair-use” protocol between Mozilla Foundation (?) and Wikipedia Foundation for using a subset of sentences for being used in Common Voice. The rule is: max 3 RANDOM sentences per article.

To enforce this, there is cv-sentence-extractor repo, which does it automatically. You set the rules, it is run my the maintainer and imported by the staff.

Please read the information on the repo first, check the rules, and how people did implement them (I spent a nice couple of months to get good results - because, when it runs and inserted, you cannot go back).

These links can also be of help (search more using “sentence extractor” keywords):

And this is my PR which includes long steps I had to take for Turkish:

One change thou: The recording limits are increased, so you might increase the max-lengths / max-words in your rules (wrt. existing ones).

I hope this helps…

Gweltaz_DG · August 2, 2024, 5:03pm

Thank you for taking the time to answer my question, Bodzen. It does help !

I was a bit confused by how MCV gives Wikipedia (which is CC-BY-SA, I think) as a possible source while still enforcing the CC-0 requirement. I wasn’t aware of a special agreement between the two and the 3 sentence per article rule. Thanks for clarifying !

Does that mean that any contribution from Wikipedia should go through the cv-sentence-extractor exclusively ? (I guess it could easily break the 3 sentence rule otherwise, as the sentence extractor doesn’t keep track of the source articles for each sentence)

I’ve been developing my own tools for parsing and cleaning Breton corpora for a couple of years. I’ve spent a few of days polishing a script to filter sentences from the Breton Wikipedia, quite severely, to make the resulting corpus as little wikipediesque sounding as possible. I use a whitelist approach (for common words, proper nouns, foreign words, acronyms and named entities), which tends to give better results for low-resources languages like Breton IMO, rather than a blocklist approach. I’m down to 35k sentences right now, trying to favor precision rather than recall.
I guess some of the filtering and cleaning rules I use could be adapted and merged in the cv-sentence-extractor, but others would be much more tricky.
Take number normalization, for example. You cannot simply use a substitution dictionary for Breton, as the order of the words can change depending on the quantity, if it is followed by a noun :
“22 labous” (22 birds) would normalize to “daou labous warn-ugent” (two birds and twenty), for example.
The correct wording depends on the gender of the noun as well, so “22 wezenn” (22 trees) would normalize to “div wezenn warn-ugent”.

Integrating those tools into cv-sentence-extractor would be a substantial endeavor, if possible at all.
On the other hand, it would be much simpler for me to adapt my scripts so it takes only 3 sentences per article at most, and add the article’s URL as a citation for every sentence (which would give a mean to check the compliance with the legal terms).
I can definitely ask other native speakers (maybe a linguist) to review samples of the filtered corpus and measure error rate, though.

bozden · August 2, 2024, 4:37pm

Glad if I could be of help…

Yes, it should. Sentence-extractor keeps track of the articles, so a couple of years later if you run it again you can get new sentences from new articles.

I guess some of the filtering and cleaning rules I use could be adapted and merged in the cv-sentence-extractor

I tried the same, but instead of per-language code, you may like to generalize a rule. Contact Michael Kohler (the maintainer) for these.

wikipediesque

I’ll borrow this, great term

But cv-sentence-extractor only deals with “words”, so you need to finetune those using the rules + blacklist (whitelist original blacklist to get final blacklist). Most western Wikipedia’s contain large number of articles and many sentences per article, so eliminating many words (thus sentences) would not hurt much as there are so many to choose from.

Run it with defaults, finetune with the existing rules/blacklist, add your whitelist and see where it goes.

mkohler · August 12, 2024, 5:21pm

Somewhat, it doesn’t really keep track of the articles, but rather the last date it was run. Then we can run it for all articles that are newer than that date.

I would say as long as it’s somewhat general and behind a configuration option, I would probably accept almost any PR.