Moving to a new topic to avoid noise in the other conversation, since this is more practical.
I’ve done
python3 word_usage.py -i wiki.es.txt >> usage.es.txt
awk '$2 ~ /^1$/' usage.es.txt >> uniques.es.txt
According to this there are 212388 unique words, most of them are weird terms or non-native terms. Fixing the single quote issue would reduce this number. I think it would be super safe to remove all sentences with these words (even the ones with 2, 3, 4 and even 5 repetition are complex or weird, from the samples I’ve seen).
Additional math:
Repetitions | No. of words | Sentences affected |
---|---|---|
1 | 212388 | 212388 |
2 | 55068 | 110136 |
3 | 26208 | 78624 |
4 | 15560 | 62240 |
5 | 10523 | 52615 |
Total | 319747 | 516003 |
Removing all these sentences would give us 970574 out of the total 1486577 extracted. Having in mind that on avg. these take 5s to record each, this would give us 1348 hours (versus 2064 hours by using all of them).