Requesting the Cantonese language (yue)

We’re the Cantonese Computational Linguistics Infrastructure Development
Workgroup (CanCLID), a team of volunteers from Guangdong, Guangxi, Hong
Kong, Macau, and the United States. We want to add the Cantonese language to Common Voice.

  1. Which language code should be used?
    yue

  2. Which script should be used?
    Hant (Han, Traditional) and Hans (Han, Simplified) can be used with Cantonese. However, we recommend adding Hant first because our volunteers are more capable and familiar with Hant.

Hans can usually be generated by mechanical transliteration from Hant. If necessary, we can provide manually checked conversions.

An issue was also opened here: https://github.com/mozilla/common-voice/issues/2926.

2 Likes

Common Voice has already been collecting Cantonese data for a while, the code is zh-HK : https://commonvoice.mozilla.org/zh-HK/speak There are 60 hours of data available (37 validated) in this past summer’s data release. I believe the prompts I’ve seen are all in Traditional characters.

I’m sure the project would greatly benefit from your team’s contributions to speaking, verifying, contributing sentences, and spreading the word to more Cantonese speakers!

I would like to make the suggestion that all contributors of speech should always ensure that they are logged in, with one voice per account and one account per voice (this is the case for all languages). The website currently seems to allow anonymous contributions, which are not attached to a speaker ID and are much less useful for several applications.

Yes, I’m aware of the zh-HK corpus, but a lot of Cantonese users do not reside in Hong Kong, and tagging the language under a specific region is putting off at best and down-right offensive to other non-Hong Kong Cantonese speakers. Using zh-hk for Cantonese is also like calling Portugese as “Romance-Brazil”.

Also, Cantonese as used in Hong Kong consists of two stratums - a literary one which is based on Mandarin Chinese grammar, and a vernacular one which has its own grammar rules and its own set of particles. The current zh-HK locale doesn’t distinguish between them. It’s the equivalent of mixing Bokmål and Nynorsk into a single language.

We’re asking for a new “yue” language code on Common Voice where Cantonese can be recorded and categorized correctly. This is the code that’s recommended by BCP-47 and understood by the ICU. We can migrate the zh-HK data into this set later if necessary.

Note that there are also other languages under the macrolanguage zh. There will likely be requests for adding other languages like nan and hak. By lumping everything under zh, it’s the equivalent of using the same language code for both French and Spanish (both are romance languages which share the same roots).

I’m aware of the language situation. I don’t know how the decision was made to use the zh-HK designation on Common Voice, especially since it is not an ISO 639-2 code. I agree that the yue code would be more appropriate. However Mozilla currently seems to have very little bandwidth for Common Voice work, and I don’t know who to contact for this change.

Cantonese as used in Hong Kong consists of two stratums - a literary one which is based on Mandarin Chinese grammar, and a vernacular one which has its own grammar rules and its own set of particles. The current zh-HK locale doesn’t distinguish between them.

Where does the request for the additional yue code fit into this? With two different datasets, zh-HK and yue, there would still be no distinction between literary and vernacular, not to mention regional varieties of Yue like Taishanese. What would be different in the yue dataset, besides the name? If the only difference between the two datasets would be political, then it strikes me as unproductive to confuse the issues.

We currently intend the yue to be vernacular only. How the ‘literary’ version should be tagged is unknown to us, and also not something our team can invest in at this point. There are tons of (proprietary, unfortunately) data sets of Cantonese users reciting paragraphs written in the ‘literary’ stratum, but extremely underdeveloped corpus for the vernacular stratum.

Our team does not have people fluent in Taishanese. Linguistically speaking Taishanese is a dialect of Yue, but culturally speaking Taishanese is often regarded as separate from the Guangzhou Yue, which is deemed as the standard form of Yue.

I think that classification of the Sze Yup dialects would be better reserved for those speakers.

The guide at 📖 Readme: How to see my language on Common Voice pointed me here, and I believe the yue tag is much more in-line with the strategy outlined in that document.

The project sounds very interesting to me, but I’m still concerned about the relationship with the existing dataset. Have you looked at the text prompts in the zh-HK dataset? I believe the prompts are primarily vernacular Cantonese, even if some of them have Mandarin-influenced grammar or vocabulary. Here are some samples taken from the released dataset:

我就成日喺到諗嘞
所以我要取消佢哋外國護照嘅重發權
唔該咪成日喺度煲電話粥
你哋快啲啦!二嫂佢哋而家喺康城站等緊你哋
何老師問現正在銅鑼灣銅鑼灣道勸學生早啲返屋企
我小學個老師住喺何文田加多利軒

Is this different from the sort of language you are proposing for the new dataset? I believe the Mozilla policy is that each language corresponds to a 639-2 code, and regional varieties can be added to a list to be selected from by users. I would suggest changing the existing zh-HK code to yue. As I’m not part of the Mozilla team, I won’t bother you about the issue any more if you want to let them take it up. Please just be aware that they will likely be very slow to respond, yet I think your organization’s contributions would be great for the project!

Thanks for your suggestions! Yes, we checked out the existing corpus on Github and we think the constructions generally are not very vernacular.

“何老師問現正在” isn’t vernacular, the natural vernacular form would be “何老師問而家喺”. Also I’ve noted there are at least thirty sentences that start with the same prefix in the sentence collector text file; those sentences do not sound like they occur naturally and seem to be injected by a unknown source.

Here is also a random chunk from the sentence collector text file, none of which are vernacular either:

可見他也同謀,
可見心思是同從前一樣狠。
可讓您提交的數據更加豐富
可軟化血管和降血糖
可輕鬆換取,一齊密密玩密密賺
可選以下口味: 綠茶, 柚子, 芝麻, 紅豆
可選擇坐非常可愛的動物巴士
台上的少女同聖誕老人

Thanks for the explanations. My Cantonese teacher taught us 而家 but often used 正在 and other Mandarin words as well, so I don’t have a very clear picture of when or where these switches occur. It doesn’t sound to me like a different language, but perhaps you’ll get some response from Mozilla.

Unfortunately many of the languages in Common Voice are plagued by these prompts which appear to be automatically generated. Perhaps some of the sentences which you object to could be removed from zh-HK as well. I have commented on this problem here in the past, but now that Mozilla is spending minimum resources on this project, it seems unlikely to be fixed. I think this will be problematic for training some types of ASR and Text-to-speech models.

For other applications like collecting a corpus for linguistic analysis, I don’t think Common Voice in its current form will be a viable platform, although note that the website source code is open source and I’ve seen it used on other projects. If you want to have control over what sentences (text prompts) become part of the project, this will be impossible on Common Voice, as anyone can add and verify sentences. It seems like your guidelines for 粵語 sentence contributions would have to be much stricter than for the other languages, although I would be more than happy to see improvements in these guidelines and checks.

Contributors at zh-hk team had aware of the issue. We mostly agree and support to add yue as a new language on Common Voice, and leave the current “zh-hk” for Cantonese in Hong Kong specifically.

One reason is that the Cantonese had transformed much in Hong Kong a lot. It’s not just a dialect or accent problem. The difference also exists in vocabulary and grammar system, influenced by decades of cultural differences.

It looks like a problematic route to try overcoming the difference between Hong Kong Cantonese & Cantonese from other places in one set of data, also politically impracticable. It could unnecessarily consume our local efforts on contributing Common Voice.

Irvin

3 Likes

Please allow me to give my two cents here. What we are proposing here is to add Cantonese as an independent language. The essential issue that Mozilla is currently having, is not treating a language as a language, but rather a dialect or variant. Mozilla is mistaking a macro-language -> language relationship as a language -> sub-dialect relationship. We should be clear that Chinese is NOT a language, but rather a macro-language or a language family. Cantonese (ISO 693-3 code yue) and other Sinitic languages such as Hakka (hak) or Southern-Min (nan) are independent languages among this family. At the moment Mozilla uses ISO 693-2 codes for locales, which only has zh as a language code, thus inevitably cramping multiple independent languages into one single locale. So imho, maybe the whole localization structure of Mozilla needs an overhaul. Use ISO 693-3 instead of 693-2 codes, otherwise we will be faced with this same problem if we want to add more Sinitic languages (Hakka, Southern-Min) in the future.

Since we have the support from the MozTW community liason and the understanding of the zh-HK team, is there anything we can do to proceed to get the language added to pontoon for Common Voice?

I need to point out that

It’s not just a dialect or accent problem. The difference also exists in vocabulary and grammar system, influenced by decades of cultural differences.

is clearly not true. Hong Kong Cantonese has only very minor difference from Canton Cantonese (廣州音), linguistically they are still the same dialect(粵語廣府片). Also,

try overcoming the difference between Hong Kong Cantonese & Cantonese from other places

is clearly neither the case. As pointed above they are literally the same language and same dialect, the difference between Hong Kong and Canton Cantonese is even smaller than that of US and UK English. And this is exactly why we should amend the original locale structure and add the yue locale to avoid further confusions.

not to mention regional varieties of Yue like Taishanese.

Mentioning other varieties of Yue like Toishanese is pointless, just like we never think about Southwestern Mandarin in the zh-CN locale. When we talk about Cantonese, we are referring to the standard Cantonese, just like we are referring to Modern Standard Mandarin when we talk about Mandarin.

The root cause of the dilemma here is that the whole Mozilla locale structure is flawed in the first place. Using the ISO 693-2 codes is a mistake, because it wrongly classifies Chinese as a language, given it is actually a macro-language, or a language family. That’s why we can’t find a place for the Cantonese language without using the 693-3 codes. Chinese is equivalent to the term Roman. Roman is a language family, Spanish is Roman, French is Roman, but you can’t say somebody speaks a language called “Roman”. Like @hfhchan said, it is much recommended to use the BCP-47 codes, which have a better classification for languages.

I would suggest changing the existing zh-HK code to yue .

This is actually our biggest concern. Will Mozilla permit such change?

Irvin Chen already stated some points which the Mozilla Hong Kong community also agreed. Please leave the current “zh-hk” for Cantonese in Hong Kong.

This week I also learned from my trusted network that debates in Chinese forums/groups about Yue and Cantonese in the language community are never ended, I think the debates should stay in those communities.