Cleaning up sentence corpus

nukeador · July 3, 2019, 9:14pm

Moving to a new topic to avoid noise in the other conversation, since this is more practical.

I’ve done

python3 word_usage.py -i wiki.es.txt >> usage.es.txt
awk '$2 ~ /^1$/' usage.es.txt >> uniques.es.txt

According to this there are 212388 unique words, most of them are weird terms or non-native terms. Fixing the single quote issue would reduce this number. I think it would be super safe to remove all sentences with these words (even the ones with 2, 3, 4 and even 5 repetition are complex or weird, from the samples I’ve seen).

Additional math:

Repetitions	No. of words	Sentences affected
1	212388	212388
2	55068	110136
3	26208	78624
4	15560	62240
5	10523	52615
Total	319747	516003

Removing all these sentences would give us 970574 out of the total 1486577 extracted. Having in mind that on avg. these take 5s to record each, this would give us 1348 hours (versus 2064 hours by using all of them).

Topic		Replies	Views
Calidad de las frases de wikipedia Español (es)	58	2967	July 25, 2019
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	623	August 12, 2024
Question about CV Sentence Extractor quality and your experience Common Voice	18	1592	August 30, 2023
Balancing most common words vs sentences number Common Voice sentence-collection	10	1038	July 6, 2019
Future of the Sentence Extractor - Your input is required Common Voice sentence-collection	11	1847	May 28, 2021

Cleaning up sentence corpus

Related topics