Question about CV Sentence Extractor quality and your experience

Until now I kept myself away from Wiki* and thus Sentence Extractor, but I’m getting out of resources (sigh)…

I scanned some random samples in Turkish “Vikipedi” and found many of them are out of topic for Common Voice, has many foreign names, chemical substance names, short entries giving a list (e.g. a football players games) etc.

Here, I see many tools such as blacklists and/or vocabulary, but as far as I can see that would need a considerable time investment and trial-error to produce good results.

We have around 500k entries on “Vikipedi”, which could result 1.5 M sentences, but scanning them is nearly impossible with current manpower… And if quality sentences come out, that would solve half of our problems for years to come.

I want to hear from those who used this process:

  • Do you get good results if you invest the time?
  • Which parts of the rules are most important?
  • Can you change a “bad sentence” with a better one and/or exclude that sentence?
  • Any other advice?

PS: @mkohler forwarded me to ask the question here…


Thanks for bringing this up and creating a Discourse post instead of an issue.

I would suggest to have a look at the DE rules file, we spent quite some time improving it recently and I think it’s at a quite good state these days. It also used the best practices.

No, because we need to guarantee the legal constraints around this. The official script will be run once the rules file is merged, so no changes are possible.

And overall we want < 5% error rate, so overall quality should not be too bad. However there will be complicated sentences slipping through, but in most cases this will be outweighed by the benefit of having quite a lot of new sentences. This however can also be tweaked with the blocklist depending on how many occurrences you set as threshold. Needs quite some time to get right, but also worth it.

1 Like

Thanks again… And sorry for posting in github without seeing a question template.

And overall we want < 5% error rate …

I find this value too high. For me, even 1% is high, which would introduce 15k bad sentences to the current 50k… Many of our contributors are from higher education and would complain about it.

So, to reach the ideal, you need to run a clone locally many times, playing with the rules until you are satisfied… That would mean checking the 1.5 M results (a sample of course) multiple times.

Some follow-up questions:

  • Are the results deterministic in consecutive runs - or randomized?
  • Turkish is an agglutinative language, that would mean a single root can result in millions of generated words. Does this software work on root words (i.e. does it use stemmer?) or just string of characters (e.g. in vocabulary, blacklists etc).

I think with German we’re below 2% at this point. In the end you can do quite a lot with the rules and the blocklist. I think it’s possible to get even further down, but that might start to result in lost sentences and obviously needs time to implement. In the end no process is perfect.

You can download and run the WikiExtractor once, and then change the rules and only run the Sentence Extractor from then on. That definitely saves time. You can also only run the Sentence Extractor for a few seconds. In the end I would suggest to look at patterns you can eliminate rather than sentence by sentence, that would take too much time. Most of the patterns are getting clear with the first run already and then it comes to fine-tuning.

In the GitHub Action (official script) it’s randomized. However locally you can change this line here to be std::usize::MAX instead of 3 to get all sentences, if that helps with the rules. Then it should be deterministic.

It basically checks how many times a certain word exists in the full text and then adds any words below your defined threshold to the block list. Try it out, but there certainly is a chance this won’t work for this case. Would love to hear the result of this.

1 Like

Thank you for all these info @mkohler, I’ll give a shot in the following weeks…

Actually there are many agglutinative languages which would have the same problem, a single word here becomes a long sentence in English for example - like German’s compound words. That is also a main hindrance in n-gram based language models if they start getting big. Also, to measure our “vocabulary” we need to use a stemmer, e.g. while checking how much of the dictionary we could cover in CV/Sentence Collector.

haha, it’s been years since I’ve last heard that. In all seriousness, of course the block list will filter out valid words, but it will also filter out hard to pronounce foreign words. In the end it’s a tradeoff.

@mkohler, the following issue and related PR gave me an idea on the workflow.
The problem they are facing is from a Wikipedia import as far as I can see…

Suggested workflow

  • We work on the rules/blacklists etc and make it ready as good as possible.
  • Wikipedia already gives 3 sentences/article MAX permission, I’d say they would allow 2 in some :slight_smile: These come out, but not automatically get imported, they will be put into a wiki-drafts repo.
  • The community can scan these and make deletion, maybe some typo correction PRs. No addition PR’s will be allowed. We can even write some additional validation scripts to highlight possible problematic sentences (e.g. locating proper names).
  • When the checks are complete, that file gets moved to the main repo so that they can be recorded.

What do you think? Probably you already thought/tried similar stuff over the years…

But IMHO no blacklist or re. rule can replace a human and I think the community should be able to intervene…

5% error in 1.5 M sentences makes 75.000 and this is 50% more than I was able to add over the last year, reading line by line multiple times to keep the quality high. And it seems from the discussions, whenever they get in, they cannot come out (exception legal issues perhaps)…

1 Like

To be fair, the Wiki extract for French was done a long time ago, before there were further possibilities to create more granular rules.

Of course! 2 would be totally fine as well. Though it’s basically a loss in sentences overall. Any post-removal will mean not as many sentences, as the extractor tries to get 3 per article that match all the rules and tries until the there are no more sentences left for that article.

I would still recommend doing the rules/blocklist and see where this brings you. Generally your approach sounds doable to me, though typo fixes will make it very hard to review to guarantee the legal limit. If it’s only removals, this should be easy to review, and might be something we could do. But I’m not a lawyer and @heyhillary would need to jump in for coordinating this. Overall, if legal says it’s ok, I’m ok with it, but generally I’m really not a fan of that approach because that means that future exports on new articles can’t be done automatically.

Does that make sense?

1 Like

I’m aware that the import is from long ago, it has nothing to do with the tool. But it is a good example of how things can go bad.

I would still recommend doing the rules/blocklist and see where this brings you.

That’s for sure, you need to get a good set to review in any case.

But generally I’m really not a fan of that approach because that means that future exports on new articles can’t be done automatically.

I think they can go through the same workflow. wiki_delta_yyyymmdd.txt like.

Additionally, it can be made optional. If a community thinks it is enough to have 1-2% bad sentences, they can automatically import it.

The problem with languages with low number of articles in wiki (i.e. less voluntarism in their culture, less computer literacy etc, also meaning low on CC0 corpora who would need the extractor) the existing articles are mostly translated versions of English ones.

Beside proper names, as English is the common language in most technical areas, related terminology is either exactly copied as English wording, or badly replaced by some invented words in their language. IMHO, those sentences should not be part of the CV text-corpus… I’m not sure if Latin names of chemical/biological compounds from an article can pass the provided filters, but many foreign proper names will pass.

…Unless CV could introduce domain specific corpora as planned in 2022 roadmap. And if the domain specific corpora becomes part of CV, we would need to find a way of tagging existing sentences/recordings, probably also effecting the whole toolset, from S.C. to S.Ex.

Until then, that kind of sentences should not be in there IMHO.


So better workflow suggestion:

This can easily be an extension to SC I think… “Wiki QC” menu item… +Two people say OK, than it is OK… Exports can go automatically into it… (sorry that the suggestion means more work for you)…

But that would not be automatic :wink:

Removals are easy. If the diff is all red, so all removals only, then I don’t see any legal issue at all and it doesn’t need a full review.

The problem arises with typo fixes, then every single change needs to be reviewed from a legal perspective to make sure it’s really just a typo fix.

Neither of those two would fit into the manual review flow through Sentence Collector IMHO. So, as you suggest, one solution could be to review all the extract. For that this could indeed be done through Sentence Collector, but really, I don’t think this scales well.

@mkohler : I’m interested in any information you have about the list of WP articles used for the french dataset and their revision. This information seems unrecoverable from the repositories alone.

Regarding French, I made some statistics regarding the “proper noun / common words” ratio with a base between 9 and 15% :

  • 10.1% for the deputies transcripts
  • 11.4% for wikisource
  • 14% for theater
  • 10.6% for Gutenberg average

In the case of Gutenberg, almost 10% of the rarest words were from a limited set of books (old french or biology Dictionaries for example, …)

I don’t have the number for Wikipedia but the percentage was not only higher but the proper nouns were way more stranger/foreign than in the above datasets which would rather contain latin-language-derived ones.

I still think the overall ratio “proper noon / common words” are quite high.

My suggestions when it comes to whitelists and filtering :

  • Select new books with care (The risk are lexicons, specialized dictionaries, …)
  • Select Wikipedia articles with care (Use preferably the ones which were not mere translation, other wise you may end up will a lot of Pokemon, Dumbledore or Counter-strike but fewer elements representative of the Turkish language)
  • Get a good extended Turkish whitelist of toponyms
  • Get a Turkish lemma-dictionary containing all the possible “conjugations” you’d have to preserve. For French we have Morphalou containing 159 271 lemmas and 976 570 “shaped form” of said lemmas.
1 Like

@drzraf, thank you… We have online dictionaries and dumps of it to work on, but only roots of the words of course. It is nearly impossible to get a list of all possible “valid/in use” words with all possible suffixes. We usually use the reverse, get a word, use a stemmer to find the stem word and check it against the dictionary/whitelist. This is the only meaningful way for Turkish.

Can we even do that?

The initial extract was done back in 2019:

Back then, the Sentence Extractor did only have rudimentary rules, and therefore many improvements done since then are obviously not reflected in the initial extract. There were a lot of learnings since then, including the block list and other more granular rules that are now possible. What we could do is a new extract of the new articles since then, but that would first require an update to the rules file in the Sentence Extractor:

Well, not at the current state. The WikiExtractor does not give us much information on the article, so being selective there without further implementation of additional API calls won’t be possible.

In general I would be very careful about mass adding sentences, adding 1.5M sentences might seem good, but there is really no way to quality control them and then you end up with issues about things that are unpronounceable or boring or repetitive or sound

Probably it is better to add them in batches of at most 10% more than the existing amount, this way you won’t dilute the ones that are already there. Practically and theoretically speaking it should not be necessary to have tens of millions of unique sentences to make a good ASR model. This partly comes from the “single recording per sentence” limit, which is problematic for other reasons.

Well, not at the current state. The WikiExtractor does not give us much information on the article, so being selective there without further implementation of additional API calls won’t be possible.

WikiExtractors (at least the ones I am familiar with that work from the pages-articles.xml.bz2 dumps) give you access to the article title, so this should be possible.

1 Like

As you know, until now I did it very carefully, checking the repeated recording statistics (keep it around 2x) and adding text in 5k-10k batches, thanks to Sabahattin Ali :slight_smile: I have about 25-30k more, but, here, next appropriate writer to go public domain is in 2024, it will not suffice.

The main problem is with the vocabulary. We need some alternate texts to get them and Wiki is one of the options.

As the preference is “everyday conversations” I think the only other option for long lasting resource is to create a multi-domain chat application for this purpose and do campaigns around it. I was waiting for the “domain-specific” feature in CV, but i seems it will take some time.

We could double/triple our resources in the last year (2x voices, 3-4x sentences, 3x recordings), but because of the asymptotic effect, the training results are not getting better as they do a year before. A linear increase is not good enough, you need to double the resources each delta time to get near the goal (WER < 10%)… So, if I want double of existing amount, sentences should also double.

Yes, you’re right. However to me this sounds like this could have legal implications though (we somehow would need to keep track which articles were exported and what not, at least). And probably we’d need to use IDs rather than titles to make sure we do not re-extract an article when its title changes. And I also wonder how maintainable this would be. Something or someone would still need to come up with a list of article, rather than relying on categories or similar, which I initially was thinking about (sorry!).

Sorry for reviving a year-old topic, but I wanted to thank you all for your input.

It took me about 6-7 weeks to correctly form a blacklist and finetune the rules.
To implement what @drzraf suggested, I had to write a bunch of Python scripts, and mainly used the “white-list before black-list” approach and run it iteratively 100+ times. As a side note: Stemmers and Lemma’s also produce ~15-20% wrong results, so I had to re-re-re-clean them also.

It resulted only in 347k sentences (I was expecting 1.5M), unfortunately. It seems, most of the “articles” in Turkish Wikipedia are not proper articles, just lists of articles, places, and such. The main article groups are Islam, sports, medicine, animals & plants, history, Turkic states, cities, and even small towns… There were too many Arabic and Latin-based words (mostly not correct spelling), way too many typos, and “invented” words that appear to be because of low language knowledge.

The random/indeterministic selection forced me to work on the whole possible sentences (3 words or more, with a certain min-sentence-length). And due to agglutinative nature of Turkish, I could not strip out low frequency words as most of the correct words have frequency 1.

I could not get the error rate below 1%, our rather extensive testing resides between 2% and 3%. This is due to many sentences with bad grammar, caused by bad translations, bad punctuation, incomplete sentences, many Google-translated articles where sentences make no sense, badly borrowed words from other languages, and words unknown to testers (because I white-listed them to get more domains), etc.

One thing I did is to allow words from multiple domains, medicine or such, as Common Voice does not seem to include a domain-specific corpora feature in the foreseeable future.

So, bulk adding sentences issue which @ftyers pointed out, became of lesser importance. It will take a couple of years with our current speed, and I’ll be adding sentences from other sources as well to dilute these…

I documented the process rather extensively with the PR, for people who will be working on cv-sentence-extractor in the future.

Thank you again @mkohler, @drzraf, and @ftyers… It was a pain :rofl:

Don’t apologize for reopening a topic of conversation. This was so interesting and edifying to read about. I’ll dig into the PR documentation as well, thank you!

1 Like