Bulk sentences submission from Wikipedia

Gweltaz_DG · August 1, 2024, 3:31pm

Hi,

I’d like to submit a collection of sentences automatically retrieved and filtered from a Wikipedia dump, for the Breton corpus.
I’m in doubt if Wikipedia licensing is compatible with Common Voice requirements (CC0). Could anyone confirm ?

If Wikipedia is an accepted source, is it enough to put “Wikipedia” in the citation field, or should you give the complete URL to the pages for any given sentences ?

Thanks

bozden · August 2, 2024, 7:20am

Hey @Gweltaz_DG, welcome. I’m another volunteer here, who did use Wikipedia sources a year before, so I’ll try to help.

Wikipedia is not CC0, but AFAIK there is a “fair-use” protocol between Mozilla Foundation (?) and Wikipedia Foundation for using a subset of sentences for being used in Common Voice. The rule is: max 3 RANDOM sentences per article.

To enforce this, there is cv-sentence-extractor repo, which does it automatically. You set the rules, it is run my the maintainer and imported by the staff.

Please read the information on the repo first, check the rules, and how people did implement them (I spent a nice couple of months to get good results - because, when it runs and inserted, you cannot go back).

These links can also be of help (search more using “sentence extractor” keywords):

And this is my PR which includes long steps I had to take for Turkish:

github.com/common-voice/cv-sentence-extractor

Add Turkish support (2023-08 finalized)

common-voice:main ← HarikalarKutusu:feature/add-tr-rules

opened 11:05PM - 16 Jun 23 UTC

HarikalarKutusu

+132456 -1

This PR will add Turkish support to the Sentence Extractor - which can now be me…rged. ## TL;DR It took me nearly six weeks to learn, analyze and get some "acceptable" results. I had to iteratively run it many (>100) times to get better results, over 3 month worth of Wiki dumps. I'll leave rather detailed steps below as a record. - Blocklist process: Explained below - Sentence count (last example run): 347,441 - Error rate: 2.375% The given error rate is the average of 14 samples (400 sentences each), a total of 5600 sentences. This is about 1.62% of the total, or ~1.29 error margin with a 95% confidence level ([see](https://www.surveymonkey.com/mp/sample-size-calculator/)). --------- ## Problem and Goal Our main problem lies in the fact that Turkish is an agglutinative language, where some stem words can provide more than a million alternatives. Therefore, blacklisting low-frequency words does not work, as those are mostly valid words with many suffixes appended, possibly leaving out important words, thus phonemes. Without getting their stem words and checking them against dictionaries we could not white-list valid ones so that we can check the invalid ones. Our second problem: The stemmers. Due to the nature of the language, none of them are working flawlessly, especially for non-dictionary words (places, nation names, etc). And many times they misfire (we found about 20% error rate). Our third problem was the use of 3 random sentences, which would not cover all possibilities. We had to work on all possible sentences, by changing the code locally so that we can get all sentences and work on them to build a blacklist. Our fourth problem was the content quality/nature of the Turkish Wikipedia. Most were much shorter than English counterparts, many with little info, or just presenting lists. We had to try to get the most out of these by less black-listing, e.g. by relaxing proper names. Given the random selection of the script, as a result, we have worked on the whole set to mostly white-list the tokens to be able to get the best black-list. This would have resulted in a full scan of all sentences (>5.5 M) and tokens (>7M). To make this humanly possible, we had to work on 3 word minimum version and make use of dictionary checks (fortunately one open-source dictionary had non-stem word versions, although not complete). This way, the number we should check could be lowered, also with the help of the automated process described below. In spite of this, we had to spend more than a month to form a complete black-list. Other goals we had during the process: - We aim for a **near-zero error rate**, definitely less than 1 (we could not reach this objective because of the grammar quality of the originals and added domain-specific jargon marked as "foreign"). - We want **longer sentences** as we already have many shorter ones (our final estimate is 5.8 sec average recording time). - We want to **scrape most from the resource** because CC0 sentences are rare unless we write it. From a public domain book, we can process 3-5k sentences in 2-3 days, so each sentence we can get from Wikipedia counts. - We also wanted to get common native and foreign proper names (people, toponyms, brands, speakable commonly known technical terms/jargon, etc) so that we can reach a large number of domains, thus increasing the vocabulary. ## Rules We intended to extract longer sentences because as of Common Voice v14.0, the average sentence length (and thus recording duration) was low. | Measure | v14.0 Data | Wiki-Extract From Our Last Run | Real Exported Result | | :----- | --------: | -------: | ---------: | | Average Recording Duration | 3.595 | 5.839 | TBA | | Median Recording Duration | 3.264 | 5.600 | TBA | | Average Characters/Sentence | 29.923 | 67.521| TBA | | Median Characters/Sentence | 22 | 65 | TBA | | Average Words/Sentence | 4.36 | 8.78 | TBA | | Median Words/Sentence | 3 | 8 | TBA | ### Deciding on minimum words > Please note: Values in different runs are not directly comparable, as the rule set and/or blacklists evolved in time. To find the ideal point, we had to analyze the set multiple times for different limits and compare the results. We ran 2 initial round-ups for this purpose, near the middle of our black/white-listing process, also making use of dictionaries. We found out 3 or 4 min words will be needed. We used 3 words for blacklist forming, and finally played with min_characters to find a more ideal point, maximizing the resultant recording duration. **Decision: 3 words minimum, but the sentence should be at least 20 chars long.** We relaxed the `max_word_count` to `20` (later increased to 25), aiming to limit the length by newly added `max_characters` (which we set to `115` initially, but after recognizing it is the number of alphabetical characters - not the string length, we dropped it to `105` - above that 10s recording limit can be missed if spoken more slowly). ## Blacklist formation In the beginning, we started with the whole set and quickly recognized that checking 1.5M different words will be impossible. So we started working on min_words 3 version, also incorporating dictionaries and stemmers to eliminate known words. After this point, we had two phases: - Use of the `Blacklist Builder` (below) iteratively to get an intermediate blacklist (work most frequent words from top to bottom) - After we reached ~130k blacklist size, we scanned it manually to form the final blacklist (manually checked dictionaries, Wikipedia pages itself, and sometimes English-Turkish dictionaries). (As we passed the month in this lengthy process, we had to do it for the August 2023 dump, to check additional 3,800 words) ### Blacklist Builder Details A set of simple Python scripts we created runs on the whole token list using multiprocessing and chunks and contains the following data items: - **Dictionary files**: Open source dictionaries, combined in the script (started with TDK -Turkish Language Authority- and from Zemberek NLP toolset, added more in the process) - **Forced Whitelist files - base**: For forced whitelisting (extended in time). Files containing: - Abbreviations/acronyms we expanded using "replace" - Common male/female names and surnames in the language - Cities and towns in the country - Other toponyms (continents, countries, capital cities), mostly in the language - **Forced Whitelist files - added**: For forced whitelisting. These are added during the iterative process: - Common foreign names where local pronunciation is the same. - Words that were correct but auto-blacklisted by the algorithm (below) because they are not in dictionaries (e.g. toponyms) or caused by stemmer failures. - **Forced Blacklist files - base**: For premature forced blacklisting. - Single-character tokens (alphabet) which do not exist elsewhere - Other cleaned abbreviations/acronyms (we removed valid words from TDK's official list as we work with lowercase) - **Forced Blacklist files - added**: For forced blacklisting. Mainly collected during the iterative process to reach our goal. We did not want non-Turkish-alphabet words to go into the blacklists, because we can already eliminate them with `allowed_symbols_regex`, which also dropped the size of the tokens and black-list considerably. ### The Algorithm - Prepare/combine data, remove duplicates - Loop through word_usage.<lc>.dict.txt (remaining from dictionary-filtered frequency list) - If the word is in the hard blacklist, it is known, else - If the word is in the hard whitelist, OK, or else - If the word is in the dictionary, OK, else - Get the stem word using snowball stemmer (not always possible - so later we added another stemmer and used combined results): - Check it against the hard blacklist, if exists, blacklist it, or else - Check it against the hard whitelist, if exists, OK, or else - Check it against the dictionary, if exists, OK, or else - Blacklist the word ### Iterative process - We used two imaginary breaking points for our iterative process: - Part-1: Until ~20,000'th record (frequency 72), we worked more deeply by also collecting the foreign words - Part-2: After that, until 100.000'th record (frequency 12), we collected missed Turkish words and some important foreign ones - Part-3: After that we let the algorithm run as it is automatically. - During Part 1 & 2: - We checked every word and distributed them between forced black-list and forced white-list files. - The above algorithm has been run several times, until we reach 10% of the remaining, to collect misfirings, where we distribute the detections between hard blacklists and hard whitelists. - For foreign words (usually proper nouns), we determined their relevance in the country (e.g. historical figures) and their pronunciation. If it is pronounced the same in the language, we whitelisted them, or else foreign words got blacklisted. - Part-3: - We ran the algorithm, got a blacklist file of the remaining non-recognized ones, and sorted it to better visually recognize the same (missed) stem word with multiple suffixes. There were ~ 1M words... - We pass through words starting with "a" (~65k) in 5-6 hours and saw that about 18% of them are still extractable. - **As that would result in more than anticipated time demand, we decided to change our approach (for the remaining - below freq 12, b-z)** We decided on a more thorough analysis of the data because the list kept being huge and had many words with characters not in the alphabet. **We should have used the rules from the start! We should not use the `--no_check` parameter** So we ran two round-ups mentioned above: **Final Phase:** To *finally* get the full blacklist, check the results, and enhance some rules in the process. ### Results After 3rd Iteration with 2023-08 Wikipedia Dump | Exp | Rules | BL | MinW | Exp | Sentences | Avg. Len | Tokens | Non-dict | Description | | :----- | :-: | :---: | :--: | :--: | --------: | -------: | ---------: | ---------: | :---------- | | max | :x: | :x: | 1 | ∞ | 5,618,313 | 110.13 | 1,953,986 | 1,541,266 | U1, no_check, no replace, no apostrophes split | | maxs | :x: | :x: | 1 | ∞ | 5,607,441 | 111.25 | 1,586,532 | 1,172,797 | U2, no_check, +replace, +ap. split | | maxsp | :x: | :x: | 1 | ∞ | 5,509,426 | 109.51 | 1,471,893 | 1,060,495 | U2, no_check, +replace, +bracket removal, +ap. split | | maxr | :white_check_mark: | :x: | 1 | ∞ | 1,211,889 | 81.25 | 530,259 | 273,839 | U3, +rules (non-limiting) | | maxrb | :white_check_mark: | :white_check_mark: | 1 | ∞ | 848,212 | 78.79 | 328,092 | 97,034 | U4, +rules (non-limiting), +BL (125k) | | Z4r | :white_check_mark: | :x: | 4 | ∞ | 1,001,331 | 70.55 | 446,920 | 183,945 | No BL | | Z4rb | :white_check_mark: | :white_check_mark: | 4 | ∞ | 706,169 | 69.15 | 269,451 | 31,089 | BL (126k) | | Z4rb3s | :white_check_mark: | :white_check_mark: | 4 | 3 | 342,263 | 68.18 | 182,634 | 15,582 | BL (126k) | | Z3r | :white_check_mark: | :x: | 3 | ∞ | 1,026,916 | 69.44 | 451,385 | 186,868 | No BL | | Z3rb | :white_check_mark: | :white_check_mark: | 3 | ∞ | 725,908 | 67.96 | 272,042 | 31,993 | BL (126k) | | Z3rb3s | :white_check_mark: | :white_check_mark: | 3 | 3 | 348,316 | 67.02 | 182,895 | 15,582 | BL (126k) | After final adjustments to min sentence length: | Exp | Rules | BL | MinW | Exp | Sentences | Avg. Len | Tokens | Non-dict | Description | | :----- | :-: | :---: | :--: | :--: | --------: | -------: | ---------: | ---------: | :---------- | | F3r | :white_check_mark: | :x: | 3 | ∞ | 1,017,403 | 69.94 | 450,327 | 165,966 | +rules - No BL (all possible) | | F3rb | :white_check_mark: | :white_check_mark: | 3 | ∞ | 720,819 | 68.54 | 267,912 | 7,204 | +BL (131k) (remaining possible) | | F3rb3s | :white_check_mark: | :white_check_mark: | 3 | 3 | **347,441** | 67.52 | 181,675 | 4,625 | +3 sentence/article | ### Simple statistics of the final run ```json { "lc": "tr", "infile": "/home/bozden/GITREPO/data/wiki.tr.F3rb3s.txt", "char_dur": 0.1, "s_cnt": 347441, "sentence_len": { "tot": 23459807, "min": 23, "max": 129, "avg": 67.52170008720906, "med": 65.0, "std": 24.050121831833685 }, "normalized_len": { "tot": 23085498, "min": 22, "max": 128, "avg": 66.44436897199812, "med": 64.0, "std": 24.019759366951778 }, "alpha_len": { "tot": 20289265, "min": 20, "max": 105, "avg": 58.39628886631112, "med": 56.0, "std": 20.94889008475279 }, "word_count": { "tot": 3051270, "min": 3, "max": 25, "avg": 8.782124159209765, "med": 8.0, "std": 3.2639047106477945 }, "duration": { "tot": 563.5906944444444, "min": 2.0, "max": 10.5, "avg": 5.839628886631111, "med": 5.6000000000000005, "std": 2.09488900847528 } } ``` We expect 560-600 hours of single recordings from this set. ## Test sets We alpha-tested some sampling and made some corrections first. | No | Persona | [No - Error Rate](File Link) | | :-: | :---------------------------------- | :------------------------------- | | 1 | Me (knows stuff) (TR/EN/DE) | [001: 1.00%](https://bit.ly/3NTiO3m) - [002: 3.00%](https://bit.ly/44gFLCG) | Initial findings: - There are some proper names that are not pronounced equally in Turkish passed the filters (they should be blacklisted) - => We found out that we did not pull the latest `stem_separator_regex` changes. So we had to repeat the X4rb3s and test generation. - Constructs like "M-class planets" or some foreign names with dash might cause problems while reading. - => Added `-` to `stem_separator_regex` For the population size of ~350,000, with a 95% confidence level and 2% margin of error, we needed 2,385 sample size to be checked. Rounding this value to 2400, we created 6 non-intersecting sets of size 400 sentences. For this, we used the 4-word sentences 3 sentence/article as population and offered them to volunteers via translated/enhanced Excel sheets. | No | Persona | [No](File Link) Error-Count/Rate | | :-: | :---------------------------------- | :---------------------------------------- | | 1 | Me (knows stuff) (TR/EN/DE) | [01](https://bit.ly/3NTiO3m) 7/ 1.75% & [02](https://bit.ly/44gFLCG) 5/ 1.25% | + | 2 | Ret. radio speaker (TR) | [03](https://bit.ly/44iYIVm) 4/ 1.00% & [09](https://bit.ly/443Vci0) 4/ 1.00% | r | 3 | Ret. pharmacist (TR/little EN) | [04](https://bit.ly/3NqfECF) 15/ 3.75% & [08](https://bit.ly/4301my9) 11/ 2.75% | r | 4 | High school student (TR/EN) | [05](https://bit.ly/441EIGW) 14/ 3.50% | | 5 | AI expert (TR/EN) | [06](https://bit.ly/3PzdxyP) 6/ 1.50% | | 6 | Art historian (TR/little EN) | [07](https://bit.ly/3NSgxFp) 16/ 4.00% | | 7 | Computer Engineer (TR/EN) | [10](https://bit.ly/3py3KPd) 4/ 1.00% | Total errors / Total Sentences - Error rate: 86 / 4,000 - 2.15% - Many of the errors are because of the low language/editor quality of the articles themselves, which just cannot be prevented with a word-level blocklist. - The second major source is that some people did not use the Common Voice system and we did not give enough information about "readability", they rather evaluated the sentences like an editor preparing a text for print (e.g. saying "it will be better to put a comma here"). Therefore we re-scanned the results to correct some. - Also, when disconnected from the content, some sentences can become not easy to understand. After more iterations & black/whitelisting, using samples from max_words = 3: | No | Persona | [No](File Link) Error-Count/Rate | | :-: | :---------------------------------- | :---------------------------------------- | | 1 | Me (knows stuff) (TR/EN/DE) | [11](https://bit.ly/44MBOGh) 11/ 2.75% | | 2 | Ret. radio speaker (TR) | [12](https://bit.ly/3PPpuAM) 7/ 1.75% | | 3 | Ret. pharmacist (TR/little EN) | [13](https://bit.ly/44phVFe) 15/ 3.75% | | 7 | Computer Engineer (TR/EN) | [15](https://bit.ly/3NRYHRx) 14/ 3.50% | Total errors / Total Sentences - Error rate: 47 / 1,600 - 2.94% In general, the error rate becomes 134 / 5,600 = 2.39% ## Code additions/fixes During the course, we needed some code changes/additions: - Fix --strip-apostrophes code for non-standard apostrophes in [cvtools](https://github.com/dabinat/cvtools/pull/8) - Add `max_characters` rule to [cv-sentence-extractor](https://github.com/common-voice/cv-sentence-extractor/pull/183) - Add `stem_separator_regex` rule for enabling stem-word extraction from apostrophe suffixed words in [cv-sentence-extractor](https://github.com/common-voice/cv-sentence-extractor/pull/187) - Add `bracket_removal_list` rule: To remove parentheses/brackets and the content inside them from a sentence Suggested additions: - Rule: `replace_unicode` to regex-replace same-looking characters from other Unicode pages - e.g. written with Cyrillic keyboards. We had a lot of them and had to use the replace list. Many Turkic countries use these keyboards and some of the other Tuckic language Wiki articles got translated via campaigns. We found ~75k such words affecting ~25k sentences.

One change thou: The recording limits are increased, so you might increase the max-lengths / max-words in your rules (wrt. existing ones).

I hope this helps…

Gweltaz_DG · August 2, 2024, 2:17pm

Thank you for taking the time to answer my question, Bodzen. It does help !

I was a bit confused by how MCV gives Wikipedia (which is CC-BY-SA, I think) as a possible source while still enforcing the CC-0 requirement. I wasn’t aware of a special agreement between the two and the 3 sentence per article rule. Thanks for clarifying !

Does that mean that any contribution from Wikipedia should go through the cv-sentence-extractor exclusively ? (I guess it could easily break the 3 sentence rule otherwise, as the sentence extractor doesn’t keep track of the source articles for each sentence)

I’ve been developing my own tools for parsing and cleaning Breton corpora for a couple of years. I’ve spent a few of days polishing a script to filter sentences from the Breton Wikipedia, quite severely, to make the resulting corpus as little wikipediesque sounding as possible. I use a whitelist approach (for common words, proper nouns, foreign words, acronyms and named entities), which tends to give better results for low-resources languages like Breton IMO, rather than a blocklist approach. I’m down to 35k sentences right now, trying to favor precision rather than recall.
I guess some of the filtering and cleaning rules I use could be adapted and merged in the cv-sentence-extractor, but others would be much more tricky.
Take number normalization, for example. You cannot simply use a substitution dictionary for Breton, as the order of the words can change depending on the quantity, if it is followed by a noun :
“22 labous” (22 birds) would normalize to “daou labous warn-ugent” (two birds and twenty), for example.
The correct wording depends on the gender of the noun as well, so “22 wezenn” (22 trees) would normalize to “div wezenn warn-ugent”.

Integrating those tools into cv-sentence-extractor would be a substantial endeavor, if possible at all.
On the other hand, it would be much simpler for me to adapt my scripts so it takes only 3 sentences per article at most, and add the article’s URL as a citation for every sentence (which would give a mean to check the compliance with the legal terms).
I can definitely ask other native speakers (maybe a linguist) to review samples of the filtered corpus and measure error rate, though.

bozden · August 2, 2024, 4:22pm

Glad if I could be of help…

Yes, it should. Sentence-extractor keeps track of the articles, so a couple of years later if you run it again you can get new sentences from new articles.

I guess some of the filtering and cleaning rules I use could be adapted and merged in the cv-sentence-extractor

I tried the same, but instead of per-language code, you may like to generalize a rule. Contact Michael Kohler (the maintainer) for these.

wikipediesque

I’ll borrow this, great term

But cv-sentence-extractor only deals with “words”, so you need to finetune those using the rules + blacklist (whitelist original blacklist to get final blacklist). Most western Wikipedia’s contain large number of articles and many sentences per article, so eliminating many words (thus sentences) would not hurt much as there are so many to choose from.

Run it with defaults, finetune with the existing rules/blacklist, add your whitelist and see where it goes.

mkohler · August 12, 2024, 5:21pm

Somewhat, it doesn’t really keep track of the articles, but rather the last date it was run. Then we can run it for all articles that are newer than that date.

I would say as long as it’s somewhat general and behind a configuration option, I would probably accept almost any PR.

Topic		Replies	Views
Use of Wikipedia Sentences Common Voice sentence-collection	1	364	August 5, 2024
[Technical feedback needed] Wikipedia extractor script beta Common Voice sentence-collection , feedback	76	8352	July 1, 2020
Sentence Extraction now automated Common Voice	4	1306	March 19, 2020
[Common Voice] Technical help needed to grow our sentence diversity DeepSpeech	0	933	July 30, 2019
Retrieving Wikipedia content under CC0 licence Common Voice sentence-collection	4	1924	August 9, 2018

Bulk sentences submission from Wikipedia

Related topics