I’m sharing with the kab contributors on our FB page some stats about the kab corpus. The corpus was analyzed after tokenization and pos tagging using NLTK Perceptron Tagger. I used a model I have already generated from another corpus.
For graphs and networks I used: matplotlib, numpy, networkx and pylab
Grammatical classes (tags)
Punctuation VS Alphabet
We use these stats to avoid repetitive words and syntatic forms.