Stats about Common Voice: Kabyle Corpus

dataset
(Muḥend Belqasem) #1

I’m sharing with the kab contributors on our FB page some stats about the kab corpus. The corpus was analyzed after tokenization and pos tagging using NLTK Perceptron Tagger. I used a model I have already generated from another corpus.

For graphs and networks I used: matplotlib, numpy, networkx and pylab

I analyzed:
Word length
Sentence lenght
Grammatical classes (tags)
Punctuation VS Alphabet
Verbs/Aspect
Verb occurence
Word Occurence


We use these stats to avoid repetitive words and syntatic forms.

2019-02-07%2012_12_12-Figure%201

2019-02-07%2012_05_52-Figure%201

0 Likes

(Rubén Martín [away until April 24th]) #2

Cool! Is this something that can be re-used for other languages? Are the tools to generate this openly available?

Cheers.

0 Likes

(Muḥend Belqasem) #3

Yes but the scripts deal only with Kabyle language. I mean tokenization, POS tag… they are free on Github/Gitlab since months or years :smile :slight_smile: I’ll check if I uploaded the last updates (mozillakab on github and I use mostly French to explain/describe things)

0 Likes