Konkani and its Variants: Which script to pick for MCV website interface?

chasingdragonflies · September 10, 2024, 2:36am

Hello konkanis!
I am trying to reason out which konkani script to use for the MCV website. The Konkani dataset however will support all other scripts too (multi-orthography). People will still be able to speak/write in their script of choice. But the Mozilla Common Voice (MCV) website cannot be in multiple scripts.

The scripts being used to currently write Konkani are: Devanagari, Romi, Kannadi, Malayalam, and Perso-Arabic.

I invite all konkani speakers to make their points for which script they would prefer for using the website (also please mention where you are from and which dialect you speak).

I have made my own points below in attempt to answer the following questions.

Main Questions:

Would Konkani in देवनागरी script for MCV website be easy to understand by konkani literates in Karnataka(KA), Goa (GA), Maharashtra (MAH)?
Given that most of the jobs require us to write and read in english, would Roman script be easier to understand by most speakers of konkani (including MAH, KA, GA)?
Can Kannadi script be understood by GA and MAH speakers? (Its an obvious no, given that they either use devnagri or roman script to read and write)
Can konkani speakers from MAH and KA understand the words/vocabulary spoken in Goa?

Important note

Other scripts will still be supported in one of the databases. This is only a discussion for website language.

Brief notes about the konkani language

Konkani is mainly spoken in Goa. But Goans are not the only ones who speak konkani. On the map, there is a konkan region. The northern part of konkan region is in Maharashtra, the central part in Goa, and the southern part in karnataka.
Konkani is officially recognised as an individual language as per the Language Census 1977 by the Govt. Of India. It is not a dialect of Marathi.
Goan Konkani, Maharashtrian Konkani and Karnataka Konkani are broad categories of the Konkani variants/dialects. But even inside these states, there are differences in speaking and writing.
Within Goa, the Antruz variant is pushed as the standard in schools and colleges. But there are more variants/dialects classified under “Goan Konkani” such as bardesi and saxtti.
In Karnataka, the konkani dialects are GSB (Gaude Saraswat Brahmin) and GSC (Gaude Saraswat Christian Brahmin).
Lobab on https://extraetc.wordpress.com, says:

Any effort of thrusting one dialect as the standard – such as what is happening in Goa, will lead into the disintegration of the language, slowly but surely!

While on wikipedia under “current status and issues”, it reads:

Konkani language has been in danger of dying out over the years, one of the reasons being the fragmentation of Konkani into various, sometimes mutually unintelligible, dialects.

While the website in roman script would be easier to read for most young people, they might not know some rules such as ‘m’ and ‘n’ being used for nasalized vowels. Forcing roman script on all konkani speakers would deter contributors because they might not understand the words. Apart from nasalization, there is also vowel substitution for most words (अ in devanagari script is changed to ऑ in roman script).
Devnagri website would be easier to translate in other languages, but difficult to read due to the presence of modifiers above and below the letters. Sometimes 2-3 letters are combined and it looks very complicated on screen - making it really difficult to read. But then devanagari users can change their font size.

Konkani in Karnataka (Canara, canarese)

The Question Papers (search google for konkani language karnataka question papers) of recent (2017-2024) final year of high school (10th standard) exams in Konkani Subject are in both Devanagari and Kannada writing scripts in the State of Karnataka.
However, as the konkani syllabus/curriculum is prepared mainly in the kannada script, it indicates that KA students are reading & writing mainly in that script.
Class 10 Hindi Subject exams are conducted in devanagari script (search google), but it is kept as 3rd language. Meaning they can choose to study some other language in place of hindi. Which means some of them might not get their dose of devanagari script learning.
Kannada language (in same script) is taught as either the 1st or 2nd language for schools upto 10th std. English is also either 1st or 2nd language.
Conclusion: Speakers of this konkani variant may not find it easy to understand websites in devanagari.
KARNATAKA SCHOOL EXAMINATION AND ASSESSMENT BOARD website

Konkani in Maharashtra

There is no mention of “Konkani” as a subject for schools of SSC and HSC on maharashtra board of education. Hence I haven’t been able to retrieve exam question papers of previous years.
There is mention of “Konkan Divisional Board” under the Maharashtra Education Board which may be teaching konkani in Ratnagiri and Sindhudurg districts. But, i have not found source to a single question paper. I tried contacting the konkan divisional board by email, but “the address was not reachable”.
They obviously will use Devanagari script to write konkani, as marathi is also written in devanagari.
(Based on News) Speaking and Studying Marathi is “mandatory by law without exception” in Maharashtrian schools.
?Can speakers of this variant understand websites translated in “standard” Goan Konkani?
Maharashtrian konkani is in gray area of being a dialect of marathi and dialect of konkani.

Konkani in Goa

The Question Papers for konkani subject exams in 10th, 11th, 12th standard (schools and pre-university) are in devanagari script ONLY in the State of Goa (search google konkani language goa board question papers). (Only Q.P. of 10th class in years 2018 and 2019 are uploaded. I am not aware of any changes made to the konkani subject after NEP 2020 was implemented.)
10th std. Konkani Assessment Scheme prepared by Goa Board of Secondary Education is also written in devanagari script.
Konkani is the local language of the people of Goa.
Devanagari is given more attention during the formation of new words. It is the official writing script for konkani used in schools and govt. officies in the State of Goa.
Roman script is currently used mainly by christians in bibles, select weeklies, magazines and theatre-drama (tiartr).
Naturally, there is a large number of people in Goa who have studied konkani in devanagari script.
(Politics) No proposal to include Konkani written in Roman script in Goa’s official language Act: CM Sawant

Kerala Konkani

Survey on konkani in State of Kerala done in 1971
Kochi has konkani speakers? Which script do they use? Are they following a standard?

chasingdragonflies · September 13, 2024, 1:14pm

For example:
I vote in favor of devnagri script to be used for konkani website because:

The fact that google can currently only translate devanagari script konkani (देवनागरी लीपी कोंकणी) into other languages (english, hindi, kannada, malayalam, etc)
The fact that most konkani literates will understand at least 1 additional language other than konkani.
The fact that we can continue reading the sentences in the konkani script which we are working with (any of the 5 scripts), even when translation is on. After testing on chrome and firefox (firefox with “TWP Translate Web Pages” extension), everything is translated except the “sentence cards”. This maintains the core functionality of common voice website even when the user is dependent on translation software.

chasingdragonflies · September 26, 2024, 6:11pm

I talked just now with a konkani professor from karnataka. They might really need the website to be in kannada script.

Because for them the kannada language is taught from Standard/Grade 1. While Hindi (devanagari script) is taught from grade 5. They give more preference to kannada script in karnataka.
The other reason is the vocabulary. There are many Konkani words in karnataka that are different from Goan konkani.

Frederick_Noronha · October 28, 2024, 10:20pm

This is an unfair choice. Konkani is probably the only language in the world to be currently written in five scripts. This diversity has to be taken into account; technology has to be adjustable.
I initiated the Konkani Wikipedia in Incubation sometime around 2006, 2007. We there took a conscious decision to permit ALL scripts to function simultaneously We managed (though still struggling)( with three scripts – http://gom.wikipedia.org We would have loved to work with the two smaller communities too if possible (Malayalam script and Perso-Arabic).
Please consider how you could do your best to accomodate all. That would be really help.
FN/Frederick Noronha
+91-9822122436

chasingdragonflies · October 31, 2024, 6:35pm

@Frederick_Noronha
Unfortunately, like most websites, Mozilla’s localisation system does not support 1 ‘language code’ to have multiple-script website localisation (website language).

Website localisation is done as per the language code (ISO-639 spec). Most major languages have 1 language code. Konkani has 3: kok, gom, knn.

I think we should utilise these 3 to create separate locales for romi, kannada and devnagri scripts. Separating the localisation based on script rather than based on region (Maharashtra/goa/karnataka/kerala) would be better as Romi is largely based on bardezi dialect, kannada based on mangluri and devnagri based on antruzi. Although we can always make room for mixing of vocabulary from other dialects to remain inclusive

In the case of the other two scripts (Malayalam & Perso-arabic), would it be better to establish new ISO-639 language codes for them as konkani currently has only 3?

kathyreid · November 1, 2024, 4:28am

Respectfully disagree.

IMHO, CV should adopt BCP-47 language codes instead of ISO-639-3 codes (actually it uses a mix of ISO-639-1, ISO-639-3 and BCP-47 codes currently, see this code).

The reason for this is that BCP-47 allows distinguishing a spoken language from its orthographic variations. For example, Azeri az can be written in either Cyrillic or Latin, and I think in Arabic script too (not sure).

The BCP-47 codes defined for Azeri are:

az - Azeri - irrespective or orthography or geographic variant
az-Cyrl - Azeri as written in Cyrillic, irrespective of geographic variant
az-Latn - Azeri as written in Latin, irrespective of geographic variant
az-Latn-AZ - Azeri as written in Latin, as spoken in Azerbaijan

That is, BCP-47 allows finer-grained representation of written and spoken language - including the ability to distinguish between multiple orthographies of the same spoken language.

bozden · November 1, 2024, 4:50am

@kathyreid, the problem is a single language with multiple scripts can only have a single frontend language, which is defined in Pontoon.

So, you can define the az in Pontoon & as dataset language in CV (they should be in parallel), and you can have now have sentence & speech variants, like the others you listed above (bcp-47). But you have to choose one (Cyrillic or Latin in this case) for the frontend.

kathyreid · November 1, 2024, 4:52am

Ah, now I understand. Is it possibly to have front-ends, one for each orthography, that are connected to the same CV dataset?

bozden · November 1, 2024, 4:30pm

@kathyreid, unfortunately what you say is not possible.

It is “technically” possible if you totally divide the dataset (e.g. into az-cyrl, az-latn), but AFAIK it is not desired as they are variants, not languages. You can have the same “sounds” in both datasets due to transliteration for example, and you should join them. The Konkani dataset would be divided into 5

I’m currently helping Circassian languages (ady, kbd) where the most diaspora communities are in Turkey, I had to create a transliteration variant (e.g. ady-Latn-TR-t-ady-cyrl - Latin-Turkish alphabet) because very few can read Cyrillic here. But the frontend should be Cyrillic. They will click “blindly”. We need to teach them with online courses - press that, than that etc, or better we started with Turkish interface - they can switch.

Sorry to hijack the thread @chasingdragonflies …

chasingdragonflies · November 5, 2024, 7:41pm

I have noticed that on Pontoon, the Romansh language (rm) has been set up with BCP-47 language tags of their standard variants: rm-sursilv and rm-vallader. Can’t this be done for Konkani? In Karnataka, the manglorean variety of konkani is popular. And since it is karnataka, almost all of the konkanis there write in the kannada script.

But yes, @bozden you’re right, that doesn’t make it a separate language.

Still, a significant amount of konkani speakers come from karnataka. Almost the size of Goa. And they deserve the konkani website to be in kannada script as well.

kathyreid · December 4, 2024, 3:03am

Is there a one to one relationship between a language in Common Voice and a language in Pontoon? Such that to have Konkani in two scripts in Pontoon, it would need to be two separate languages in Common Voice?

bozden · December 4, 2024, 2:01pm

Yes it would.

Actually, after there are 100+ languages, newer languages are mainly learned (L2) where a national/native language becomes L1.

Except promoting the use of the language, there is no reason to have the UI in the dataset language. E.g. one could be able to have English UI and write/record/validate in any other language.

It is already like this in Spontaneous Speech, and I think separating them in the classic MCV will solve all these problems.

chasingdragonflies · February 4, 2025, 9:58am

@bozden @kathyreid

As @Frederick_Noronha has said, “technology has to be adjustable”, I agree.

I have gone through https://gom.wikipedia.org and https://gom.wiktionary.org for Goan Konkani (gom). Their website translation, as in the frontend not the content, is having either romi (latin) or devanagari translation or where possible, both. As in, some strings are translated in romi and some in devanagari. We could do the same for Common Voice, by combining Latin and Devanagari scripts in the “gom” locale and freeing up the “knn” locale for Kannada script. From the user’s point of view this is a workable solution as none of the goans understand kannada script but can read latin and devangari easily. The sentence cards (content to be read/written) can be given a script tag to identify them. So also the users can be given an option to see only the scipt they prefer to record and validate.

This might however result in two separate databases for konkani. Which is not desirable for the users of the resulting database. But I think a combination of all the scripts into a single database would be trivial.

What are your thoughts?

Is a change in https://pontoon.mozilla.org (the tool that translates Common Voice and other mozilla products) required such that each script has separated translations throughout? Please guide!

@Frederick_Noronha Would you like to discuss here further so that a solution can be reached?

bozden · February 4, 2025, 1:28pm

@chasingdragonflies, I think the correct method would be how it is implemented in Spontaneous Speech: Have the dataset and UI languages separate. I re-voiced the idea also in last month’s AMA meeting - for classic MCV.

I don’t think having one language with multiple scripts for UI will be possible in the near future due to tight connection to Pontoon.

If one can separate UI from dataset language, then we can talk about having a UI in a script but not having a dataset for that script (as it will be as a script/sentence-variant).

And for now, you should just pick the most common / meaningful script for UI.

chasingdragonflies · February 16, 2025, 12:33pm

Yes. That would be great. Also I checked with pontoon’s devs; they say it won’t cause any problem if the language codes are gom-knda for kannada script, gom-devn for devanagari, and so on as there are en-GB and en-US regional codes.

The only thing left is konkani community support to maintain the translations of each of the konkani scripts in pontoon and also moderation? of the sentences collected in MCV.

Maybe that’s the reason folks at mozilla are hesitating to activate the language in all of its scripts.

Topic		Replies	Views
Requesting the Cantonese language (yue) Common Voice	16	3327	January 13, 2021
List of languages with variants launched on common voice Common Voice	5	840	October 16, 2024
:speaking_head: Feedback needed: Languages and accents strategy Common Voice participation , feedback	50	7532	March 25, 2020
Common Voice Android (unofficial app) Common Voice	33	3319	April 26, 2025
I want to bring in a new language Sanskrit for voice recogntion Common Voice	4	992	February 22, 2020