Given that the original Turkish corpus is limited, we need to add more sentences. The process is very well defined in the related micro-site:
- Finding CC0 sources (public domain material)
- Extracting / reviewing / filtering sentences on your computer
- Posting them to the sentence-collector
- Two other persons should review them and accept or reject.
- Accepted new data will be compiled and added to the main corpus (bi-weekly) automatically.
Then we will see them while recording our voice.
After reviewing 1K+ recordings I recognized the following pitfalls and suggest some remedies:
- The base data is from a news site, so the language is more synthetic. => More everyday conversational material is needed.
- The news are from Balkans mostly, so, there are many Serbian / Macedonian person / city names, most of them are hard to pronounce. => Sentences related to Turkey (Turkish names /cities etc) and life in Turkey are needed.
- Turkish has many (really many!) words coming from Ottoman / Arabic / Farsi, and we are using them in real life extensively. The current data does not include them. => Speeches from older people / older books etc will remedy this. Of course “dead words” should be eliminated (or replaced by current ones) during the process.
I researched some possible resources like TBMM meeting minutes, laws etc, but they are mostly of no use (very long sentences, politically biased / incomplete sentences etc). But we may find some in this domain.
Main problem is with the CC0 restrictions. As “Creative Commons” is not very well known in Turkey, it is not used/specified in the sources. International/local laws & legislation dominate here. But like every other country, we also have “public domain” concept.
Any writer’s work after 70 years of his/her death becomes public domain and can be used. Unfortunately the language in these books will be rather old and we cannot use them as they are. Fortunately there are some writers who have been writing in modern Turkish, whose works become public domain recently.
To live the experience I worked on Sabahattin Ali’s “Kürk Mantolu Madonna” and added selected sentences to the sentence-collector (around 2400-2500). These need to be verified by other contributors of course. That particular book is very good because it contains many natural conversations. I had to replace some words with the newer ones (such as garp, cenup => güney, batı) of course.
Please use this thread to suggest new sources, talk about betterment of Turkish dataset etc. What we need now:
- Add new sentences (https://commonvoice.mozilla.org/sentence-collector/#/add)
- Verify them (https://commonvoice.mozilla.org/sentence-collector/#/review)
- Recruit new volunteers (More women! More people with higher age!) to record them
- More work on listening
Any idea/suggestion is welcome. I’ll update this post to include more information as needed.
Happy volunteering & AI’ing…
Comp. Eng. MSc / Museologist