Stats about Common Voice: Kabyle Corpus

belkacem77 · February 7, 2019, 11:25am

I’m sharing with the kab contributors on our FB page some stats about the kab corpus. The corpus was analyzed after tokenization and pos tagging using NLTK Perceptron Tagger. I used a model I have already generated from another corpus.

For graphs and networks I used: matplotlib, numpy, networkx and pylab

I analyzed:
Word length
Sentence lenght
Grammatical classes (tags)
Punctuation VS Alphabet
Verbs/Aspect
Verb occurence
Word Occurence

We use these stats to avoid repetitive words and syntatic forms.

nukeador · February 7, 2019, 12:25pm

Cool! Is this something that can be re-used for other languages? Are the tools to generate this openly available?

Cheers.

belkacem77 · February 7, 2019, 12:30pm

Yes but the scripts deal only with Kabyle language. I mean tokenization, POS tag… they are free on Github/Gitlab since months or years :smile I’ll check if I uploaded the last updates (mozillakab on github and I use mostly French to explain/describe things)

Topic		Replies	Views
An ongoing academic work about Kabyle using CV and DeepSpeech Common Voice sentence-collection	3	592	December 15, 2019
Sentence collector Dashbord & prizes Common Voice sentence-collection	5	699	May 23, 2019
NLP, Voice recognition & Kabyle: Commmon Voice at the University of Tizi Ouzou Common Voice sentence-collection	0	784	November 26, 2019
Kab team program about Mozilla's Projects: Session 01 Common Voice and Sentence collector Common Voice sentence-collection	5	1113	August 19, 2020
Common Voice Dataset V.11 Common Voice	5	2263	October 4, 2022

Stats about Common Voice: Kabyle Corpus

Related topics