This PR will add Turkish support to the Sentence Extractor - which can now be me…rged.
## TL;DR
It took me nearly six weeks to learn, analyze and get some "acceptable" results. I had to iteratively run it many (>100) times to get better results, over 3 month worth of Wiki dumps. I'll leave rather detailed steps below as a record.
- Blocklist process: Explained below
- Sentence count (last example run): 347,441
- Error rate: 2.375%
The given error rate is the average of 14 samples (400 sentences each), a total of 5600 sentences. This is about 1.62% of the total, or ~1.29 error margin with a 95% confidence level ([see](https://www.surveymonkey.com/mp/sample-size-calculator/)).
---------
## Problem and Goal
Our main problem lies in the fact that Turkish is an agglutinative language, where some stem words can provide more than a million alternatives. Therefore, blacklisting low-frequency words does not work, as those are mostly valid words with many suffixes appended, possibly leaving out important words, thus phonemes. Without getting their stem words and checking them against dictionaries we could not white-list valid ones so that we can check the invalid ones.
Our second problem: The stemmers. Due to the nature of the language, none of them are working flawlessly, especially for non-dictionary words (places, nation names, etc). And many times they misfire (we found about 20% error rate).
Our third problem was the use of 3 random sentences, which would not cover all possibilities. We had to work on all possible sentences, by changing the code locally so that we can get all sentences and work on them to build a blacklist.
Our fourth problem was the content quality/nature of the Turkish Wikipedia. Most were much shorter than English counterparts, many with little info, or just presenting lists. We had to try to get the most out of these by less black-listing, e.g. by relaxing proper names.
Given the random selection of the script, as a result, we have worked on the whole set to mostly white-list the tokens to be able to get the best black-list.
This would have resulted in a full scan of all sentences (>5.5 M) and tokens (>7M). To make this humanly possible, we had to work on 3 word minimum version and make use of dictionary checks (fortunately one open-source dictionary had non-stem word versions, although not complete). This way, the number we should check could be lowered, also with the help of the automated process described below. In spite of this, we had to spend more than a month to form a complete black-list.
Other goals we had during the process:
- We aim for a **near-zero error rate**, definitely less than 1 (we could not reach this objective because of the grammar quality of the originals and added domain-specific jargon marked as "foreign").
- We want **longer sentences** as we already have many shorter ones (our final estimate is 5.8 sec average recording time).
- We want to **scrape most from the resource** because CC0 sentences are rare unless we write it. From a public domain book, we can process 3-5k sentences in 2-3 days, so each sentence we can get from Wikipedia counts.
- We also wanted to get common native and foreign proper names (people, toponyms, brands, speakable commonly known technical terms/jargon, etc) so that we can reach a large number of domains, thus increasing the vocabulary.
## Rules
We intended to extract longer sentences because as of Common Voice v14.0, the average sentence length (and thus recording duration) was low.
| Measure | v14.0 Data | Wiki-Extract From Our Last Run | Real Exported Result |
| :----- | --------: | -------: | ---------: |
| Average Recording Duration | 3.595 | 5.839 | TBA |
| Median Recording Duration | 3.264 | 5.600 | TBA |
| Average Characters/Sentence | 29.923 | 67.521| TBA |
| Median Characters/Sentence | 22 | 65 | TBA |
| Average Words/Sentence | 4.36 | 8.78 | TBA |
| Median Words/Sentence | 3 | 8 | TBA |
### Deciding on minimum words
> Please note: Values in different runs are not directly comparable, as the rule set and/or blacklists evolved in time.
To find the ideal point, we had to analyze the set multiple times for different limits and compare the results. We ran 2 initial round-ups for this purpose, near the middle of our black/white-listing process, also making use of dictionaries.
We found out 3 or 4 min words will be needed. We used 3 words for blacklist forming, and finally played with min_characters to find a more ideal point, maximizing the resultant recording duration.
**Decision: 3 words minimum, but the sentence should be at least 20 chars long.**
We relaxed the `max_word_count` to `20` (later increased to 25), aiming to limit the length by newly added `max_characters` (which we set to `115` initially, but after recognizing it is the number of alphabetical characters - not the string length, we dropped it to `105` - above that 10s recording limit can be missed if spoken more slowly).
## Blacklist formation
In the beginning, we started with the whole set and quickly recognized that checking 1.5M different words will be impossible. So we started working on min_words 3 version, also incorporating dictionaries and stemmers to eliminate known words. After this point, we had two phases:
- Use of the `Blacklist Builder` (below) iteratively to get an intermediate blacklist (work most frequent words from top to bottom)
- After we reached ~130k blacklist size, we scanned it manually to form the final blacklist (manually checked dictionaries, Wikipedia pages itself, and sometimes English-Turkish dictionaries).
(As we passed the month in this lengthy process, we had to do it for the August 2023 dump, to check additional 3,800 words)
### Blacklist Builder Details
A set of simple Python scripts we created runs on the whole token list using multiprocessing and chunks and contains the following data items:
- **Dictionary files**: Open source dictionaries, combined in the script (started with TDK -Turkish Language Authority- and from Zemberek NLP toolset, added more in the process)
- **Forced Whitelist files - base**: For forced whitelisting (extended in time). Files containing:
- Abbreviations/acronyms we expanded using "replace"
- Common male/female names and surnames in the language
- Cities and towns in the country
- Other toponyms (continents, countries, capital cities), mostly in the language
- **Forced Whitelist files - added**: For forced whitelisting. These are added during the iterative process:
- Common foreign names where local pronunciation is the same.
- Words that were correct but auto-blacklisted by the algorithm (below) because they are not in dictionaries (e.g. toponyms) or caused by stemmer failures.
- **Forced Blacklist files - base**: For premature forced blacklisting.
- Single-character tokens (alphabet) which do not exist elsewhere
- Other cleaned abbreviations/acronyms (we removed valid words from TDK's official list as we work with lowercase)
- **Forced Blacklist files - added**: For forced blacklisting. Mainly collected during the iterative process to reach our goal.
We did not want non-Turkish-alphabet words to go into the blacklists, because we can already eliminate them with `allowed_symbols_regex`, which also dropped the size of the tokens and black-list considerably.
### The Algorithm
- Prepare/combine data, remove duplicates
- Loop through word_usage.<lc>.dict.txt (remaining from dictionary-filtered frequency list)
- If the word is in the hard blacklist, it is known, else
- If the word is in the hard whitelist, OK, or else
- If the word is in the dictionary, OK, else
- Get the stem word using snowball stemmer (not always possible - so later we added another stemmer and used combined results):
- Check it against the hard blacklist, if exists, blacklist it, or else
- Check it against the hard whitelist, if exists, OK, or else
- Check it against the dictionary, if exists, OK, or else
- Blacklist the word
### Iterative process
- We used two imaginary breaking points for our iterative process:
- Part-1: Until ~20,000'th record (frequency 72), we worked more deeply by also collecting the foreign words
- Part-2: After that, until 100.000'th record (frequency 12), we collected missed Turkish words and some important foreign ones
- Part-3: After that we let the algorithm run as it is automatically.
- During Part 1 & 2:
- We checked every word and distributed them between forced black-list and forced white-list files.
- The above algorithm has been run several times, until we reach 10% of the remaining, to collect misfirings, where we distribute the detections between hard blacklists and hard whitelists.
- For foreign words (usually proper nouns), we determined their relevance in the country (e.g. historical figures) and their pronunciation. If it is pronounced the same in the language, we whitelisted them, or else foreign words got blacklisted.
- Part-3:
- We ran the algorithm, got a blacklist file of the remaining non-recognized ones, and sorted it to better visually recognize the same (missed) stem word with multiple suffixes. There were ~ 1M words...
- We pass through words starting with "a" (~65k) in 5-6 hours and saw that about 18% of them are still extractable.
- **As that would result in more than anticipated time demand, we decided to change our approach (for the remaining - below freq 12, b-z)**
We decided on a more thorough analysis of the data because the list kept being huge and had many words with characters not in the alphabet.
**We should have used the rules from the start! We should not use the `--no_check` parameter**
So we ran two round-ups mentioned above:
**Final Phase:** To *finally* get the full blacklist, check the results, and enhance some rules in the process.
### Results After 3rd Iteration with 2023-08 Wikipedia Dump
| Exp | Rules | BL | MinW | Exp | Sentences | Avg. Len | Tokens | Non-dict | Description |
| :----- | :-: | :---: | :--: | :--: | --------: | -------: | ---------: | ---------: | :---------- |
| max | :x: | :x: | 1 | ∞ | 5,618,313 | 110.13 | 1,953,986 | 1,541,266 | U1, no_check, no replace, no apostrophes split |
| maxs | :x: | :x: | 1 | ∞ | 5,607,441 | 111.25 | 1,586,532 | 1,172,797 | U2, no_check, +replace, +ap. split |
| maxsp | :x: | :x: | 1 | ∞ | 5,509,426 | 109.51 | 1,471,893 | 1,060,495 | U2, no_check, +replace, +bracket removal, +ap. split |
| maxr | :white_check_mark: | :x: | 1 | ∞ | 1,211,889 | 81.25 | 530,259 | 273,839 | U3, +rules (non-limiting) |
| maxrb | :white_check_mark: | :white_check_mark: | 1 | ∞ | 848,212 | 78.79 | 328,092 | 97,034 | U4, +rules (non-limiting), +BL (125k) |
| Z4r | :white_check_mark: | :x: | 4 | ∞ | 1,001,331 | 70.55 | 446,920 | 183,945 | No BL |
| Z4rb | :white_check_mark: | :white_check_mark: | 4 | ∞ | 706,169 | 69.15 | 269,451 | 31,089 | BL (126k) |
| Z4rb3s | :white_check_mark: | :white_check_mark: | 4 | 3 | 342,263 | 68.18 | 182,634 | 15,582 | BL (126k) |
| Z3r | :white_check_mark: | :x: | 3 | ∞ | 1,026,916 | 69.44 | 451,385 | 186,868 | No BL |
| Z3rb | :white_check_mark: | :white_check_mark: | 3 | ∞ | 725,908 | 67.96 | 272,042 | 31,993 | BL (126k) |
| Z3rb3s | :white_check_mark: | :white_check_mark: | 3 | 3 | 348,316 | 67.02 | 182,895 | 15,582 | BL (126k) |
After final adjustments to min sentence length:
| Exp | Rules | BL | MinW | Exp | Sentences | Avg. Len | Tokens | Non-dict | Description |
| :----- | :-: | :---: | :--: | :--: | --------: | -------: | ---------: | ---------: | :---------- |
| F3r | :white_check_mark: | :x: | 3 | ∞ | 1,017,403 | 69.94 | 450,327 | 165,966 | +rules - No BL (all possible) |
| F3rb | :white_check_mark: | :white_check_mark: | 3 | ∞ | 720,819 | 68.54 | 267,912 | 7,204 | +BL (131k) (remaining possible) |
| F3rb3s | :white_check_mark: | :white_check_mark: | 3 | 3 | **347,441** | 67.52 | 181,675 | 4,625 | +3 sentence/article |
### Simple statistics of the final run
```json
{
"lc": "tr",
"infile": "/home/bozden/GITREPO/data/wiki.tr.F3rb3s.txt",
"char_dur": 0.1,
"s_cnt": 347441,
"sentence_len": {
"tot": 23459807,
"min": 23,
"max": 129,
"avg": 67.52170008720906,
"med": 65.0,
"std": 24.050121831833685
},
"normalized_len": {
"tot": 23085498,
"min": 22,
"max": 128,
"avg": 66.44436897199812,
"med": 64.0,
"std": 24.019759366951778
},
"alpha_len": {
"tot": 20289265,
"min": 20,
"max": 105,
"avg": 58.39628886631112,
"med": 56.0,
"std": 20.94889008475279
},
"word_count": {
"tot": 3051270,
"min": 3,
"max": 25,
"avg": 8.782124159209765,
"med": 8.0,
"std": 3.2639047106477945
},
"duration": {
"tot": 563.5906944444444,
"min": 2.0,
"max": 10.5,
"avg": 5.839628886631111,
"med": 5.6000000000000005,
"std": 2.09488900847528
}
}
```
We expect 560-600 hours of single recordings from this set.
## Test sets
We alpha-tested some sampling and made some corrections first.
| No | Persona | [No - Error Rate](File Link) |
| :-: | :---------------------------------- | :------------------------------- |
| 1 | Me (knows stuff) (TR/EN/DE) | [001: 1.00%](https://bit.ly/3NTiO3m) - [002: 3.00%](https://bit.ly/44gFLCG) |
Initial findings:
- There are some proper names that are not pronounced equally in Turkish passed the filters (they should be blacklisted)
- => We found out that we did not pull the latest `stem_separator_regex` changes. So we had to repeat the X4rb3s and test generation.
- Constructs like "M-class planets" or some foreign names with dash might cause problems while reading.
- => Added `-` to `stem_separator_regex`
For the population size of ~350,000, with a 95% confidence level and 2% margin of error, we needed 2,385 sample size to be checked.
Rounding this value to 2400, we created 6 non-intersecting sets of size 400 sentences.
For this, we used the 4-word sentences 3 sentence/article as population and offered them to volunteers via translated/enhanced Excel sheets.
| No | Persona | [No](File Link) Error-Count/Rate |
| :-: | :---------------------------------- | :---------------------------------------- |
| 1 | Me (knows stuff) (TR/EN/DE) | [01](https://bit.ly/3NTiO3m) 7/ 1.75% & [02](https://bit.ly/44gFLCG) 5/ 1.25% | +
| 2 | Ret. radio speaker (TR) | [03](https://bit.ly/44iYIVm) 4/ 1.00% & [09](https://bit.ly/443Vci0) 4/ 1.00% | r
| 3 | Ret. pharmacist (TR/little EN) | [04](https://bit.ly/3NqfECF) 15/ 3.75% & [08](https://bit.ly/4301my9) 11/ 2.75% | r
| 4 | High school student (TR/EN) | [05](https://bit.ly/441EIGW) 14/ 3.50% |
| 5 | AI expert (TR/EN) | [06](https://bit.ly/3PzdxyP) 6/ 1.50% |
| 6 | Art historian (TR/little EN) | [07](https://bit.ly/3NSgxFp) 16/ 4.00% |
| 7 | Computer Engineer (TR/EN) | [10](https://bit.ly/3py3KPd) 4/ 1.00% |
Total errors / Total Sentences - Error rate: 86 / 4,000 - 2.15%
- Many of the errors are because of the low language/editor quality of the articles themselves, which just cannot be prevented with a word-level blocklist.
- The second major source is that some people did not use the Common Voice system and we did not give enough information about "readability", they rather evaluated the sentences like an editor preparing a text for print (e.g. saying "it will be better to put a comma here"). Therefore we re-scanned the results to correct some.
- Also, when disconnected from the content, some sentences can become not easy to understand.
After more iterations & black/whitelisting, using samples from max_words = 3:
| No | Persona | [No](File Link) Error-Count/Rate |
| :-: | :---------------------------------- | :---------------------------------------- |
| 1 | Me (knows stuff) (TR/EN/DE) | [11](https://bit.ly/44MBOGh) 11/ 2.75% |
| 2 | Ret. radio speaker (TR) | [12](https://bit.ly/3PPpuAM) 7/ 1.75% |
| 3 | Ret. pharmacist (TR/little EN) | [13](https://bit.ly/44phVFe) 15/ 3.75% |
| 7 | Computer Engineer (TR/EN) | [15](https://bit.ly/3NRYHRx) 14/ 3.50% |
Total errors / Total Sentences - Error rate: 47 / 1,600 - 2.94%
In general, the error rate becomes 134 / 5,600 = 2.39%
## Code additions/fixes
During the course, we needed some code changes/additions:
- Fix --strip-apostrophes code for non-standard apostrophes in [cvtools](https://github.com/dabinat/cvtools/pull/8)
- Add `max_characters` rule to [cv-sentence-extractor](https://github.com/common-voice/cv-sentence-extractor/pull/183)
- Add `stem_separator_regex` rule for enabling stem-word extraction from apostrophe suffixed words in [cv-sentence-extractor](https://github.com/common-voice/cv-sentence-extractor/pull/187)
- Add `bracket_removal_list` rule: To remove parentheses/brackets and the content inside them from a sentence
Suggested additions:
- Rule: `replace_unicode` to regex-replace same-looking characters from other Unicode pages - e.g. written with Cyrillic keyboards. We had a lot of them and had to use the replace list. Many Turkic countries use these keyboards and some of the other Tuckic language Wiki articles got translated via campaigns. We found ~75k such words affecting ~25k sentences.