Stats about Common Voice: Kabyle Corpus

(Muḥend Belqasem) #1

I’m sharing with the kab contributors on our FB page some stats about the kab corpus. The corpus was analyzed after tokenization and pos tagging using NLTK Perceptron Tagger. I used a model I have already generated from another corpus.

For graphs and networks I used: matplotlib, numpy, networkx and pylab

I analyzed:
Word length
Sentence lenght
Grammatical classes (tags)
Punctuation VS Alphabet
Verb occurence
Word Occurence

We use these stats to avoid repetitive words and syntatic forms.



(Rubén Martín) #2

Cool! Is this something that can be re-used for other languages? Are the tools to generate this openly available?


(Muḥend Belqasem) #3

Yes but the scripts deal only with Kabyle language. I mean tokenization, POS tag… they are free on Github/Gitlab since months or years :smile :slight_smile: I’ll check if I uploaded the last updates (mozillakab on github and I use mostly French to explain/describe things)