The main goal of a language community is the betterment of the language data (text & voice corpora) on CV, so that a better voice-AI comes out.
It is like steering. You turn left, if too much you turn a bit right, if slow you hit the gas.
Currently, CV releases are 3 months apart. I think it is an ideal timing for this workflow:
- A version comes out, you analyze the data (also taking the previous releases into account)
- You find what you are lacking, or how much you improved wrt previous version.
- Plan for the next release (campaigns etc)
- Go to start
For the analysis part, I prepared two webapps. I think these will help the crucial part of these for all languages.
For example, the results of v12.0 for my language Turkish is here.
From the Text Corpus tab I can see the following:
Total tokens are ~39k.
Turkish dictionary has ~90k entries, but many of them are somewhat old (Farsi or Arabic roots) and are not spoken very much now.
On the other hand, Turkish is an agglutinative language (words are expanded by postfixes). So 39k does not represent the root words but also includes plurals etc.
So I need to work on vocabulary and extend the text-corpus.
If we look at the character or word/sentence distribution:
I see that many sentences are short. We added many conversational short sentences, this is why.
So we need to work for longer sentences.
I added all books from a famous writer, which became public, so many rather old vocabulary is also included.
So I need to find CC0 sources for new text-corpus, which are longer sentences and include new vocabulary.
Whenever I find some resources, I analyze them against the current corpora (offline scripts) and calculate the amount of new vocabulary. I also have similarity filters, where I also filter out very similar sentences.
After collecting a couple of these, I merge them and shuffle them before posting them to Sentence Collector for validation, which is done by the community. From each book, I can get some 3-5k sentences and a few hundred new vocabulary words.
I hope this example helps with your language.