Hello konkanis!
I am trying to reason out which konkani script to use for the MCV website. The Konkani dataset however will support all other scripts too (multi-orthography). People will still be able to speak/write in their script of choice. But the Mozilla Common Voice (MCV) website cannot be in multiple scripts.
The scripts being used to currently write Konkani are: Devanagari, Romi, Kannadi, Malayalam, and Perso-Arabic.
I invite all konkani speakers to make their points for which script they would prefer for using the website (also please mention where you are from and which dialect you speak).
I have made my own points below in attempt to answer the following questions.
Main Questions:
Would Konkani in à€Šà„à€”à€šà€Ÿà€à€°à„ script for MCV website be easy to understand by konkani literates in Karnataka(KA), Goa (GA), Maharashtra (MAH)?
Given that most of the jobs require us to write and read in english, would Roman script be easier to understand by most speakers of konkani (including MAH, KA, GA)?
Can Kannadi script be understood by GA and MAH speakers? (Its an obvious no, given that they either use devnagri or roman script to read and write)
Can konkani speakers from MAH and KA understand the words/vocabulary spoken in Goa?
Important note
Other scripts will still be supported in one of the databases. This is only a discussion for website language.
Brief notes about the konkani language
Konkani is mainly spoken in Goa. But Goans are not the only ones who speak konkani. On the map, there is a konkan region. The northern part of konkan region is in Maharashtra, the central part in Goa, and the southern part in karnataka.
Konkani is officially recognised as an individual language as per the Language Census 1977 by the Govt. Of India. It is not a dialect of Marathi.
Goan Konkani, Maharashtrian Konkani and Karnataka Konkani are broad categories of the Konkani variants/dialects. But even inside these states, there are differences in speaking and writing.
Within Goa, the Antruz variant is pushed as the standard in schools and colleges. But there are more variants/dialects classified under âGoan Konkaniâ such as bardesi and saxtti.
In Karnataka, the konkani dialects are GSB (Gaude Saraswat Brahmin) and GSC (Gaude Saraswat Christian Brahmin).
Any effort of thrusting one dialect as the standard â such as what is happening in Goa, will lead into the disintegration of the language, slowly but surely!
While on wikipedia under âcurrent status and issuesâ, it reads:
Konkani language has been in danger of dying out over the years, one of the reasons being the fragmentation of Konkani into various, sometimes mutually unintelligible, dialects.
While the website in roman script would be easier to read for most young people, they might not know some rules such as âmâ and ânâ being used for nasalized vowels. Forcing roman script on all konkani speakers would deter contributors because they might not understand the words. Apart from nasalization, there is also vowel substitution for most words (à€ in devanagari script is changed to à€ in roman script).
Devnagri website would be easier to translate in other languages, but difficult to read due to the presence of modifiers above and below the letters. Sometimes 2-3 letters are combined and it looks very complicated on screen - making it really difficult to read. But then devanagari users can change their font size.
Konkani in Karnataka (Canara, canarese)
The Question Papers (search google for konkani language karnataka question papers) of recent (2017-2024) final year of high school (10th standard) exams in Konkani Subject are in both Devanagari and Kannada writing scripts in the State of Karnataka.
However, as the konkani syllabus/curriculum is prepared mainly in the kannada script, it indicates that KA students are reading & writing mainly in that script.
Class 10 Hindi Subject exams are conducted in devanagari script (search google), but it is kept as 3rd language. Meaning they can choose to study some other language in place of hindi. Which means some of them might not get their dose of devanagari script learning.
Kannada language (in same script) is taught as either the 1st or 2nd language for schools upto 10th std. English is also either 1st or 2nd language.
Conclusion: Speakers of this konkani variant may not find it easy to understand websites in devanagari.
There is no mention of âKonkaniâ as a subject for schools of SSC and HSC on maharashtra board of education. Hence I havenât been able to retrieve exam question papers of previous years.
There is mention of âKonkan Divisional Boardâ under the Maharashtra Education Board which may be teaching konkani in Ratnagiri and Sindhudurg districts. But, i have not found source to a single question paper. I tried contacting the konkan divisional board by email, but âthe address was not reachableâ.
They obviously will use Devanagari script to write konkani, as marathi is also written in devanagari.
(Based on News) Speaking and Studying Marathi is âmandatory by law without exceptionâ in Maharashtrian schools.
?Can speakers of this variant understand websites translated in âstandardâ Goan Konkani?
Maharashtrian konkani is in gray area of being a dialect of marathi and dialect of konkani.
Konkani in Goa
The Question Papers for konkani subject exams in 10th, 11th, 12th standard (schools and pre-university) are in devanagari script ONLY in the State of Goa (search google konkani language goa board question papers). (Only Q.P. of 10th class in years 2018 and 2019 are uploaded. I am not aware of any changes made to the konkani subject after NEP 2020 was implemented.)
Konkani is the local language of the people of Goa.
Devanagari is given more attention during the formation of new words. It is the official writing script for konkani used in schools and govt. officies in the State of Goa.
Roman script is currently used mainly by christians in bibles, select weeklies, magazines and theatre-drama (tiartr).
Naturally, there is a large number of people in Goa who have studied konkani in devanagari script.
For example:
I vote in favor of devnagri script to be used for konkani website because:
The fact that google can currently only translate devanagari script konkani (à€Šà„à€”à€šà€Ÿà€à€°à„ à€Čà„à€Șà„ à€à„à€à€à€Łà„) into other languages (english, hindi, kannada, malayalam, etc)
The fact that most konkani literates will understand at least 1 additional language other than konkani.
The fact that we can continue reading the sentences in the konkani script which we are working with (any of the 5 scripts), even when translation is on. After testing on chrome and firefox (firefox with âTWP Translate Web Pagesâ extension), everything is translated except the âsentence cardsâ. This maintains the core functionality of common voice website even when the user is dependent on translation software.
I talked just now with a konkani professor from karnataka. They might really need the website to be in kannada script.
Because for them the kannada language is taught from Standard/Grade 1. While Hindi (devanagari script) is taught from grade 5. They give more preference to kannada script in karnataka.
The other reason is the vocabulary. There are many Konkani words in karnataka that are different from Goan konkani.
This is an unfair choice. Konkani is probably the only language in the world to be currently written in five scripts. This diversity has to be taken into account; technology has to be adjustable.
I initiated the Konkani Wikipedia in Incubation sometime around 2006, 2007. We there took a conscious decision to permit ALL scripts to function simultaneously We managed (though still struggling)( with three scripts â http://gom.wikipedia.org We would have loved to work with the two smaller communities too if possible (Malayalam script and Perso-Arabic).
Please consider how you could do your best to accomodate all. That would be really help.
FN/Frederick Noronha
+91-9822122436
@Frederick_Noronha
Unfortunately, like most websites, Mozillaâs localisation system does not support 1 âlanguage codeâ to have multiple-script website localisation (website language).
Website localisation is done as per the language code (ISO-639 spec). Most major languages have 1 language code. Konkani has 3: kok, gom, knn.
I think we should utilise these 3 to create separate locales for romi, kannada and devnagri scripts. Separating the localisation based on script rather than based on region (Maharashtra/goa/karnataka/kerala) would be better as Romi is largely based on bardezi dialect, kannada based on mangluri and devnagri based on antruzi. Although we can always make room for mixing of vocabulary from other dialects to remain inclusive
In the case of the other two scripts (Malayalam & Perso-arabic), would it be better to establish new ISO-639 language codes for them as konkani currently has only 3?
IMHO, CV should adopt BCP-47 language codes instead of ISO-639-3 codes (actually it uses a mix of ISO-639-1, ISO-639-3andBCP-47 codes currently, see this code).
The reason for this is that BCP-47 allows distinguishing a spoken language from its orthographic variations. For example, Azeri az can be written in either Cyrillic or Latin, and I think in Arabic script too (not sure).
The BCP-47 codes defined for Azeri are:
az - Azeri - irrespective or orthography or geographic variant
az-Cyrl - Azeri as written in Cyrillic, irrespective of geographic variant
az-Latn - Azeri as written in Latin, irrespective of geographic variant
az-Latn-AZ - Azeri as written in Latin, as spoken in Azerbaijan
That is, BCP-47 allows finer-grained representation of written and spoken language - including the ability to distinguish between multiple orthographies of the same spoken language.
@kathyreid, the problem is a single language with multiple scripts can only have a single frontend language, which is defined in Pontoon.
So, you can define the az in Pontoon & as dataset language in CV (they should be in parallel), and you can have now have sentence & speech variants, like the others you listed above (bcp-47). But you have to choose one (Cyrillic or Latin in this case) for the frontend.
@kathyreid, unfortunately what you say is not possible.
It is âtechnicallyâ possible if you totally divide the dataset (e.g. into az-cyrl, az-latn), but AFAIK it is not desired as they are variants, not languages. You can have the same âsoundsâ in both datasets due to transliteration for example, and you should join them. The Konkani dataset would be divided into 5
Iâm currently helping Circassian languages (ady, kbd) where the most diaspora communities are in Turkey, I had to create a transliteration variant (e.g. ady-Latn-TR-t-ady-cyrl - Latin-Turkish alphabet) because very few can read Cyrillic here. But the frontend should be Cyrillic. They will click âblindlyâ. We need to teach them with online courses - press that, than that etc, or better we started with Turkish interface - they can switch.
I have noticed that on Pontoon, the Romansh language (rm) has been set up with BCP-47 language tags of their standard variants: rm-sursilv and rm-vallader. Canât this be done for Konkani? In Karnataka, the manglorean variety of konkani is popular. And since it is karnataka, almost all of the konkanis there write in the kannada script.
But yes, @bozden youâre right, that doesnât make it a separate language.
Still, a significant amount of konkani speakers come from karnataka. Almost the size of Goa. And they deserve the konkani website to be in kannada script as well.
Is there a one to one relationship between a language in Common Voice and a language in Pontoon? Such that to have Konkani in two scripts in Pontoon, it would need to be two separate languages in Common Voice?
Actually, after there are 100+ languages, newer languages are mainly learned (L2) where a national/native language becomes L1.
Except promoting the use of the language, there is no reason to have the UI in the dataset language. E.g. one could be able to have English UI and write/record/validate in any other language.
It is already like this in Spontaneous Speech, and I think separating them in the classic MCV will solve all these problems.
As @Frederick_Noronha has said, âtechnology has to be adjustableâ, I agree.
I have gone through https://gom.wikipedia.org and https://gom.wiktionary.org for Goan Konkani (gom). Their website translation, as in the frontend not the content, is having either romi (latin) or devanagari translation or where possible, both. As in, some strings are translated in romi and some in devanagari. We could do the same for Common Voice, by combining Latin and Devanagari scripts in the âgomâ locale and freeing up the âknnâ locale for Kannada script. From the userâs point of view this is a workable solution as none of the goans understand kannada script but can read latin and devangari easily. The sentence cards (content to be read/written) can be given a script tag to identify them. So also the users can be given an option to see only the scipt they prefer to record and validate.
This might however result in two separate databases for konkani. Which is not desirable for the users of the resulting database. But I think a combination of all the scripts into a single database would be trivial.
What are your thoughts?
Is a change in https://pontoon.mozilla.org (the tool that translates Common Voice and other mozilla products) required such that each script has separated translations throughout? Please guide!
@Frederick_Noronha Would you like to discuss here further so that a solution can be reached?
@chasingdragonflies, I think the correct method would be how it is implemented in Spontaneous Speech: Have the dataset and UI languages separate. I re-voiced the idea also in last monthâs AMA meeting - for classic MCV.
I donât think having one language with multiple scripts for UI will be possible in the near future due to tight connection to Pontoon.
If one can separate UI from dataset language, then we can talk about having a UI in a script but not having a dataset for that script (as it will be as a script/sentence-variant).
And for now, you should just pick the most common / meaningful script for UI.