Turkish corpus - Main tread / Türkçe külliyat (cümleler) - Ana ileti

Dear contributors,

Given that the original Turkish corpus is limited, we need to add more sentences. The process is very well defined in the related micro-site:
In summary:

  1. Finding CC0 sources (public domain material)
  2. Extracting / reviewing / filtering sentences on your computer
  3. Posting them to the sentence-collector
  4. Two other persons should review them and accept or reject.
  5. Accepted new data will be compiled and added to the main corpus (bi-weekly) automatically.

Then we will see them while recording our voice.

After reviewing 1K+ recordings I recognized the following pitfalls and suggest some remedies:

  1. The base data is from a news site, so the language is more synthetic. => More everyday conversational material is needed.
  2. The news are from Balkans mostly, so, there are many Serbian / Macedonian person / city names, most of them are hard to pronounce. => Sentences related to Turkey (Turkish names /cities etc) and life in Turkey are needed.
  3. Turkish has many (really many!) words coming from Ottoman / Arabic / Farsi, and we are using them in real life extensively. The current data does not include them. => Speeches from older people / older books etc will remedy this. Of course “dead words” should be eliminated (or replaced by current ones) during the process.

I researched some possible resources like TBMM meeting minutes, laws etc, but they are mostly of no use (very long sentences, politically biased / incomplete sentences etc). But we may find some in this domain.

Main problem is with the CC0 restrictions. As “Creative Commons” is not very well known in Turkey, it is not used/specified in the sources. International/local laws & legislation dominate here. But like every other country, we also have “public domain” concept.

Any writer’s work after 70 years of his/her death becomes public domain and can be used. Unfortunately the language in these books will be rather old and we cannot use them as they are. Fortunately there are some writers who have been writing in modern Turkish, whose works become public domain recently.

To live the experience I worked on Sabahattin Ali’s “Kürk Mantolu Madonna” and added selected sentences to the sentence-collector (around 2400-2500). These need to be verified by other contributors of course. That particular book is very good because it contains many natural conversations. I had to replace some words with the newer ones (such as garp, cenup => güney, batı) of course.

Please use this thread to suggest new sources, talk about betterment of Turkish dataset etc. What we need now:

  1. Add new sentences (https://commonvoice.mozilla.org/sentence-collector/#/add)
  2. Verify them (https://commonvoice.mozilla.org/sentence-collector/#/review)
  3. Recruit new volunteers (More women! More people with higher age!) to record them
  4. More work on listening

Any idea/suggestion is welcome. I’ll update this post to include more information as needed.

Happy volunteering & AI’ing…
Bülent Özden
Comp. Eng. MSc / Museologist


Thanks for this post! I would like to second the use of Sabahattin Ali, it is both good literature and public domain. Essentially his entire bibliography could be included. I am happy to help out with text processing (sentence splitting etc.) if you can find the plain text.

We are out…


@ftyers Are you using this script? Did you try it on Turkish? Does it work well?

There are problems with this. Most of these are OCR’d books and OCR is not 100% perfect. Although during casual reading you don’t mind, AI would. Common issues are:

  • Wrong/missing punctuation (scripts are failing in this case)
  • Wrong recognition such as “ın” becoming “m”
  • OCR => PDF => Text would make each line end with CR/LF
  • etc
    Other non-OCR issues include:
  • Long sentences, with sub sentences like “Birden kafasını kaldırarak ‘Sen de mi gelemiyorsun?’ diye sordu” (think of longer sentences). I extract only ‘Sen de mi gelemiyorsun?’ part in this case, which is more conversation style.
  • Old/dead words should be “translated” to current language

My methodology is:

  • Find PDF, save as text
  • Using a text based word processor (I use NotePad++) remove CR/LF, consecutive spaces etc. This will give you a huge paragraph.
  • Then I add CR/LF after each word ending punctuation ( . ! ? ), then scan the document as a whole
  • I import it into an Excel column
  • Add columns/formulae for text length (I selected 100 chars max), word count (forced 14 words max). Another column decides for being this column a “prospect” (0-1). Also “Decision” & “Batch number” columns exist. Then I fix the cell values.
  • Then read all sentences/correct/divide/re-translate/reject them if necessary and fill the “decision” column.

This takes some time, but I could not find more speedy method. “Kürk Mantolu Madonna” took 5-6 hours with speed reading resulting in 2400-2500 sentences, but I had around 25-30 leftover mistakes reported by my classmates. I had to correct/reenter them… Most time consuming part is corrections of course.

I found OCR’d scans of all books from Sabahattin Ali and Orhan Veli Kanık today. I’ll work on them after-hours…

I’m open to suggestions for the betterment of the process, but it cannot be automatic as far as I can see.

Francis, how I can send you the files?

It should be possible to improve at least part of that process using Unix scripting tools. Can you send me the original PDF files and the text as a zip file to my email address and I can have a look. You can find my email address on my personal page (to avoid spam harvesting).

"Bütün Öyküleri - 1" by Sabahattin Ali.
I harvested, done proof reading (~225 pages), translated dead words and started to upload. There are around 3800 prospected sentences. It may take a while as I re-check them & upload in batches.

Please visit the sentence collector and review them. If there is something I missed (typo’s, incomplete sentences etc) just reject them (hit “thumbs down”). They will be reported to me so that I can correct and re-enter them.

But beware, the text (like many other literature) contains written forms of spoken tongue, such as “candarma” instead of “jandarma” or “gidivedim” instead of “gidiverdim”. These should be fed to the AI so that it can also learn local tongues. Please do not reject them.

Please hit here to review:

EDIT: Finished. Added about 3930 sentences…

@ftyers was kind enough to supply me with short phrases extracted from subtitles, with frequency >100 - some 63,689 lines :upside_down_face:

As a first batch I took those with f >= 1000 and cleaned them in two scans. Mostly names got removed, including Gandalf and Frodo :slight_smile:

The set contains some slang and everday swear words (which are perfectly fine by the guidelines), sorry for that… Also there are some alternative writings of the same phrases, which are also needed, such as “Aman tanrım!”, “Aman Tanrım…”. Different writing styles and emphases are good for the dataset…

I added 5108 phrases/sentences (duplicates removed) to the Sentence Collector. I think it is safe to just hit thumbs-up buttons here

Ahaha, I’m going through right now, this is excellent, it’s like every Turkish conversation I’ve ever heard. “öyle mi?”, “hadi gidelim”, “nasil yani?” :smiley: And if I had a lira for every time I heard “tamam mı?” :smiley:

