Thank you for taking the time to answer my question, Bodzen. It does help !
I was a bit confused by how MCV gives Wikipedia (which is CC-BY-SA, I think) as a possible source while still enforcing the CC-0 requirement. I wasn’t aware of a special agreement between the two and the 3 sentence per article rule. Thanks for clarifying !
Does that mean that any contribution from Wikipedia should go through the cv-sentence-extractor exclusively ? (I guess it could easily break the 3 sentence rule otherwise, as the sentence extractor doesn’t keep track of the source articles for each sentence)
I’ve been developing my own tools for parsing and cleaning Breton corpora for a couple of years. I’ve spent a few of days polishing a script to filter sentences from the Breton Wikipedia, quite severely, to make the resulting corpus as little wikipediesque sounding as possible. I use a whitelist approach (for common words, proper nouns, foreign words, acronyms and named entities), which tends to give better results for low-resources languages like Breton IMO, rather than a blocklist approach. I’m down to 35k sentences right now, trying to favor precision rather than recall.
I guess some of the filtering and cleaning rules I use could be adapted and merged in the cv-sentence-extractor, but others would be much more tricky.
Take number normalization, for example. You cannot simply use a substitution dictionary for Breton, as the order of the words can change depending on the quantity, if it is followed by a noun :
“22 labous” (22 birds) would normalize to “daou labous warn-ugent” (two birds and twenty), for example.
The correct wording depends on the gender of the noun as well, so “22 wezenn” (22 trees) would normalize to “div wezenn warn-ugent”.
Integrating those tools into cv-sentence-extractor would be a substantial endeavor, if possible at all.
On the other hand, it would be much simpler for me to adapt my scripts so it takes only 3 sentences per article at most, and add the article’s URL as a citation for every sentence (which would give a mean to check the compliance with the legal terms).
I can definitely ask other native speakers (maybe a linguist) to review samples of the filtered corpus and measure error rate, though.