Cleaning up sentence corpus

Moving to a new topic to avoid noise in the other conversation, since this is more practical.

I’ve done

python3 word_usage.py -i wiki.es.txt >> usage.es.txt
awk '$2 ~ /^1$/' usage.es.txt >> uniques.es.txt

According to this there are 212388 unique words, most of them are weird terms or non-native terms. Fixing the single quote issue would reduce this number. I think it would be super safe to remove all sentences with these words (even the ones with 2, 3, 4 and even 5 repetition are complex or weird, from the samples I’ve seen).

Additional math:

Repetitions No. of words Sentences affected
1 212388 212388
2 55068 110136
3 26208 78624
4 15560 62240
5 10523 52615
Total 319747 516003

Removing all these sentences would give us 970574 out of the total 1486577 extracted. Having in mind that on avg. these take 5s to record each, this would give us 1348 hours (versus 2064 hours by using all of them).