Stats about Common Voice: Kabyle Corpus

I’m sharing with the kab contributors on our FB page some stats about the kab corpus. The corpus was analyzed after tokenization and pos tagging using NLTK Perceptron Tagger. I used a model I have already generated from another corpus.

For graphs and networks I used: matplotlib, numpy, networkx and pylab

I analyzed:
Word length
Sentence lenght
Grammatical classes (tags)
Punctuation VS Alphabet
Verbs/Aspect
Verb occurence
Word Occurence


We use these stats to avoid repetitive words and syntatic forms.

2019-02-07%2012_12_12-Figure%201

2019-02-07%2012_05_52-Figure%201

Cool! Is this something that can be re-used for other languages? Are the tools to generate this openly available?

Cheers.

Yes but the scripts deal only with Kabyle language. I mean tokenization, POS tag… they are free on Github/Gitlab since months or years :smile :slight_smile: I’ll check if I uploaded the last updates (mozillakab on github and I use mostly French to explain/describe things)