Question about CV Sentence Extractor quality and your experience

To be fair, the Wiki extract for French was done a long time ago, before there were further possibilities to create more granular rules.

Of course! 2 would be totally fine as well. Though it’s basically a loss in sentences overall. Any post-removal will mean not as many sentences, as the extractor tries to get 3 per article that match all the rules and tries until the there are no more sentences left for that article.

I would still recommend doing the rules/blocklist and see where this brings you. Generally your approach sounds doable to me, though typo fixes will make it very hard to review to guarantee the legal limit. If it’s only removals, this should be easy to review, and might be something we could do. But I’m not a lawyer and @heyhillary would need to jump in for coordinating this. Overall, if legal says it’s ok, I’m ok with it, but generally I’m really not a fan of that approach because that means that future exports on new articles can’t be done automatically.

Does that make sense?

1 Like