I’m limited this week, to generate the list with the spacy vocabulary column need to use my fork of dabinat’s tool : GitHub - carlfm01/cvtools at spacy
Finally the generated file contains three columns : word frequency outOfSpacyVocab, reading and filtering from this file is up to you at the moment.
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
22
Interesting fact, I found a word frequency file for the whole English wikipedia here
And doing a superficial analysis, any word under 900 repetitions is complex, weird or non-native. @gregor and I are currently trying to get this word usage for all sentences in German and Spanish wikipedia and probably the number of repetitions to use might vary.
@nukeador How does the new Report button affect cleaning of the English wiki sentences? Should they still be cleaned through scripts or are we now relying on users to flag them?
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
24
I think both options are valid. The more people reporting the better, I think the plan is to run regular reports with the flagged sentences, and maybe we can automate their removal somehow.
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
25
I’m facing more complexity when analyzing word frequency.
There are some words that have high number of repetitions just because they are repeated in 1-2 articles hundreds of times, and we don’t really have a way to track that.
I’m doing some tests for Spanish, avoiding words with 80 repetitions or less, I’m getting very very low number of sentences, and still some are invalid