Cleaning up sentence corpus

I’m limited this week, to generate the list with the spacy vocabulary column need to use my fork of dabinat’s tool : https://github.com/carlfm01/cvtools/tree/spacy

To use it for other languages, need to install the target language (please use the md versions): https://spacy.io/usage/models

Then need to change the spacy model that we want to use : https://github.com/carlfm01/cvtools/blob/9533a318cd63cd7967fa18dab8ac215fdc9c7da9/word_usage.py#L104

Finally the generated file contains three columns : word frequency outOfSpacyVocab, reading and filtering from this file is up to you at the moment.

Interesting fact, I found a word frequency file for the whole English wikipedia here

And doing a superficial analysis, any word under 900 repetitions is complex, weird or non-native. @gregor and I are currently trying to get this word usage for all sentences in German and Spanish wikipedia and probably the number of repetitions to use might vary.

@nukeador How does the new Report button affect cleaning of the English wiki sentences? Should they still be cleaned through scripts or are we now relying on users to flag them?

I think both options are valid. The more people reporting the better, I think the plan is to run regular reports with the flagged sentences, and maybe we can automate their removal somehow.

/cc @mbranson @gregor

I’m facing more complexity when analyzing word frequency.

There are some words that have high number of repetitions just because they are repeated in 1-2 articles hundreds of times, and we don’t really have a way to track that.

I’m doing some tests for Spanish, avoiding words with 80 repetitions or less, I’m getting very very low number of sentences, and still some are invalid