Requesting the Cantonese language (yue)

We’re the Cantonese Computational Linguistics Infrastructure Development
Workgroup (CanCLID), a team of volunteers from Guangdong, Guangxi, Hong
Kong, Macau, and the United States. We want to add the Cantonese language to Common Voice.

  1. Which language code should be used?
    yue

  2. Which script should be used?
    Hant (Han, Traditional) and Hans (Han, Simplified) can be used with Cantonese. However, we recommend adding Hant first because our volunteers are more capable and familiar with Hant.

Hans can usually be generated by mechanical transliteration from Hant. If necessary, we can provide manually checked conversions.

An issue was also opened here: https://github.com/mozilla/common-voice/issues/2926.

2 Likes

Common Voice has already been collecting Cantonese data for a while, the code is zh-HK : https://commonvoice.mozilla.org/zh-HK/speak There are 60 hours of data available (37 validated) in this past summer’s data release. I believe the prompts I’ve seen are all in Traditional characters.

I’m sure the project would greatly benefit from your team’s contributions to speaking, verifying, contributing sentences, and spreading the word to more Cantonese speakers!

I would like to make the suggestion that all contributors of speech should always ensure that they are logged in, with one voice per account and one account per voice (this is the case for all languages). The website currently seems to allow anonymous contributions, which are not attached to a speaker ID and are much less useful for several applications.

Yes, I’m aware of the zh-HK corpus, but a lot of Cantonese users do not reside in Hong Kong, and tagging the language under a specific region is putting off at best and down-right offensive to other non-Hong Kong Cantonese speakers. Using zh-hk for Cantonese is also like calling Portugese as “Romance-Brazil”.

Also, Cantonese as used in Hong Kong consists of two stratums - a literary one which is based on Mandarin Chinese grammar, and a vernacular one which has its own grammar rules and its own set of particles. The current zh-HK locale doesn’t distinguish between them. It’s the equivalent of mixing Bokmål and Nynorsk into a single language.

We’re asking for a new “yue” language code on Common Voice where Cantonese can be recorded and categorized correctly. This is the code that’s recommended by BCP-47 and understood by the ICU. We can migrate the zh-HK data into this set later if necessary.

Note that there are also other languages under the macrolanguage zh. There will likely be requests for adding other languages like nan and hak. By lumping everything under zh, it’s the equivalent of using the same language code for both French and Spanish (both are romance languages which share the same roots).

I’m aware of the language situation. I don’t know how the decision was made to use the zh-HK designation on Common Voice, especially since it is not an ISO 639-2 code. I agree that the yue code would be more appropriate. However Mozilla currently seems to have very little bandwidth for Common Voice work, and I don’t know who to contact for this change.

Cantonese as used in Hong Kong consists of two stratums - a literary one which is based on Mandarin Chinese grammar, and a vernacular one which has its own grammar rules and its own set of particles. The current zh-HK locale doesn’t distinguish between them.

Where does the request for the additional yue code fit into this? With two different datasets, zh-HK and yue, there would still be no distinction between literary and vernacular, not to mention regional varieties of Yue like Taishanese. What would be different in the yue dataset, besides the name? If the only difference between the two datasets would be political, then it strikes me as unproductive to confuse the issues.

We currently intend the yue to be vernacular only. How the ‘literary’ version should be tagged is unknown to us, and also not something our team can invest in at this point. There are tons of (proprietary, unfortunately) data sets of Cantonese users reciting paragraphs written in the ‘literary’ stratum, but extremely underdeveloped corpus for the vernacular stratum.

Our team does not have people fluent in Taishanese. Linguistically speaking Taishanese is a dialect of Yue, but culturally speaking Taishanese is often regarded as separate from the Guangzhou Yue, which is deemed as the standard form of Yue.

I think that classification of the Sze Yup dialects would be better reserved for those speakers.

The guide at 📖 Readme: How to see my language on Common Voice pointed me here, and I believe the yue tag is much more in-line with the strategy outlined in that document.

The project sounds very interesting to me, but I’m still concerned about the relationship with the existing dataset. Have you looked at the text prompts in the zh-HK dataset? I believe the prompts are primarily vernacular Cantonese, even if some of them have Mandarin-influenced grammar or vocabulary. Here are some samples taken from the released dataset:

我就成日喺到諗嘞
所以我要取消佢哋外國護照嘅重發權
唔該咪成日喺度煲電話粥
你哋快啲啦!二嫂佢哋而家喺康城站等緊你哋
何老師問現正在銅鑼灣銅鑼灣道勸學生早啲返屋企
我小學個老師住喺何文田加多利軒

Is this different from the sort of language you are proposing for the new dataset? I believe the Mozilla policy is that each language corresponds to a 639-2 code, and regional varieties can be added to a list to be selected from by users. I would suggest changing the existing zh-HK code to yue. As I’m not part of the Mozilla team, I won’t bother you about the issue any more if you want to let them take it up. Please just be aware that they will likely be very slow to respond, yet I think your organization’s contributions would be great for the project!

Thanks for your suggestions! Yes, we checked out the existing corpus on Github and we think the constructions generally are not very vernacular.

“何老師問現正在” isn’t vernacular, the natural vernacular form would be “何老師問而家喺”. Also I’ve noted there are at least thirty sentences that start with the same prefix in the sentence collector text file; those sentences do not sound like they occur naturally and seem to be injected by a unknown source.

Here is also a random chunk from the sentence collector text file, none of which are vernacular either:

可見他也同謀,
可見心思是同從前一樣狠。
可讓您提交的數據更加豐富
可軟化血管和降血糖
可輕鬆換取,一齊密密玩密密賺
可選以下口味: 綠茶, 柚子, 芝麻, 紅豆
可選擇坐非常可愛的動物巴士
台上的少女同聖誕老人

Thanks for the explanations. My Cantonese teacher taught us 而家 but often used 正在 and other Mandarin words as well, so I don’t have a very clear picture of when or where these switches occur. It doesn’t sound to me like a different language, but perhaps you’ll get some response from Mozilla.

Unfortunately many of the languages in Common Voice are plagued by these prompts which appear to be automatically generated. Perhaps some of the sentences which you object to could be removed from zh-HK as well. I have commented on this problem here in the past, but now that Mozilla is spending minimum resources on this project, it seems unlikely to be fixed. I think this will be problematic for training some types of ASR and Text-to-speech models.

For other applications like collecting a corpus for linguistic analysis, I don’t think Common Voice in its current form will be a viable platform, although note that the website source code is open source and I’ve seen it used on other projects. If you want to have control over what sentences (text prompts) become part of the project, this will be impossible on Common Voice, as anyone can add and verify sentences. It seems like your guidelines for 粵語 sentence contributions would have to be much stricter than for the other languages, although I would be more than happy to see improvements in these guidelines and checks.

Contributors at zh-hk team had aware of the issue. We mostly agree and support to add yue as a new language on Common Voice, and leave the current “zh-hk” for Cantonese in Hong Kong specifically.

One reason is that the Cantonese had transformed much in Hong Kong a lot. It’s not just a dialect or accent problem. The difference also exists in vocabulary and grammar system, influenced by decades of cultural differences.

It looks like a problematic route to try overcoming the difference between Hong Kong Cantonese & Cantonese from other places in one set of data, also politically impracticable. It could unnecessarily consume our local efforts on contributing Common Voice.

Irvin

3 Likes

Please allow me to give my two cents here. What we are proposing here is to add Cantonese as an independent language. The essential issue that Mozilla is currently having, is not treating a language as a language, but rather a dialect or variant. Mozilla is mistaking a macro-language -> language relationship as a language -> sub-dialect relationship. We should be clear that Chinese is NOT a language, but rather a macro-language or a language family. Cantonese (ISO 693-3 code yue) and other Sinitic languages such as Hakka (hak) or Southern-Min (nan) are independent languages among this family. At the moment Mozilla uses ISO 693-2 codes for locales, which only has zh as a language code, thus inevitably cramping multiple independent languages into one single locale. So imho, maybe the whole localization structure of Mozilla needs an overhaul. Use ISO 693-3 instead of 693-2 codes, otherwise we will be faced with this same problem if we want to add more Sinitic languages (Hakka, Southern-Min) in the future.

Since we have the support from the MozTW community liason and the understanding of the zh-HK team, is there anything we can do to proceed to get the language added to pontoon for Common Voice?

I need to point out that

It’s not just a dialect or accent problem. The difference also exists in vocabulary and grammar system, influenced by decades of cultural differences.

is clearly not true. Hong Kong Cantonese has only very minor difference from Canton Cantonese (廣州音), linguistically they are still the same dialect(粵語廣府片). Also,

try overcoming the difference between Hong Kong Cantonese & Cantonese from other places

is clearly neither the case. As pointed above they are literally the same language and same dialect, the difference between Hong Kong and Canton Cantonese is even smaller than that of US and UK English. And this is exactly why we should amend the original locale structure and add the yue locale to avoid further confusions.

not to mention regional varieties of Yue like Taishanese.

Mentioning other varieties of Yue like Toishanese is pointless, just like we never think about Southwestern Mandarin in the zh-CN locale. When we talk about Cantonese, we are referring to the standard Cantonese, just like we are referring to Modern Standard Mandarin when we talk about Mandarin.

The root cause of the dilemma here is that the whole Mozilla locale structure is flawed in the first place. Using the ISO 693-2 codes is a mistake, because it wrongly classifies Chinese as a language, given it is actually a macro-language, or a language family. That’s why we can’t find a place for the Cantonese language without using the 693-3 codes. Chinese is equivalent to the term Roman. Roman is a language family, Spanish is Roman, French is Roman, but you can’t say somebody speaks a language called “Roman”. Like @hfhchan said, it is much recommended to use the BCP-47 codes, which have a better classification for languages.

I would suggest changing the existing zh-HK code to yue .

This is actually our biggest concern. Will Mozilla permit such change?

Irvin Chen already stated some points which the Mozilla Hong Kong community also agreed. Please leave the current “zh-hk” for Cantonese in Hong Kong.

This week I also learned from my trusted network that debates in Chinese forums/groups about Yue and Cantonese in the language community are never ended, I think the debates should stay in those communities.

1 Like

please finish the ui translation on pontoon and add 5000 sentences at sentences collector.

經過一輪沉澱,心沒有真的很平靜。
香港當下有年青人為這片土地背負沉重代價,他們沒有猶疑,沒有後悔… …

回想2019年初, Sammy找我幫忙推動Common Voice項目,希望能建立起zh-HK以廣東話為目標的語音資料庫。當時「自由香港字型」項目已有一定基礎,同時開始 Qbo One機械人的探索。廣東話語音資料庫的建立,將能日後方便普羅市民(包括長者們)應用機械人/人工智能等不同設備。而我覺得這個資料庫應是收集廣大市民的聲音、擴大市民的參與,推動他們攜手貢獻。

作為一個多年來參與開源社群的社會工作者,Common Voice 項目正是可以讓群眾參與的極佳平台。原本安排在2019年的樂齡科技博覽會就向普羅市民推介CV, 但因社會運動影響, 取消了參展。及至2020年, 我們重新在樂齡科技博覽會展出, 向市民介紹CV, 得到市民的熱烈反應和支持, 嘗試錄音和了解該項目。展出以後, 有兩所大學的老師計劃組織百多位大學生參與, 更有多位長者主動要求加入幫忙。

記得我在項目開始之初, 已親自到台灣跟台灣朋友了解項目的執行, 明白到CV是想收集到民間的聲音, 可能發音有異, 重點正正是社會的現況, 收集到這地方的真正聲音。

有朋友要求逐點回應所提出的問題, 由於我不是資訊科技專才, 恕難做到。只知道剛才用Android手機的語音助理時, 顯示「中文(繁體中文,香港)」, 我講廣東話, 佢識得正確顯示中文字。

最近新西蘭新一屆國會議員宣誓, James McDowall讀誓詞先英後廣東話, 就是要向香港人表達支持, 不是向廣東人表示友好! 所以廣東話是香港的象徵之一。

所以, 我認為zh-HK在行業內和世界的認知, 已普遍認同香港是使用正體中文和廣東話的代表了。

至於其他華語的社群,

難道zh-cn的朋友不知道有上海話、福建話、潮州話、四川話?
難道zh-tw的朋友不知道台灣有不同族群語言嗎?
他們都不知道Han嗎?

他們是因為一個request才驚醒原來在這3個華語社群貢獻的人是大錯特錯? 還是他們是樂於去貢獻, 建設語音資料庫, 那些標示反而是小問題。難道zh-cn和zh-tw沒有境外的朋友在支援?

在文章指出English的en-UK和en-US的情況, 但不提及CV內挪威語的情況?

我覺得這些意見總是沒完沒了, 也沒有甚麼新意。因為Common Voice各華語社群啟動時已作出選擇和決定, 都是以在有限資源下做出對社會最大效益作前題, 作出無私貢獻。

至於是否新增yue? 就按Mozilla的本身政策和程序進行吧。

最後, 請尊重和感激所有為社群貢獻的人, 社群人數早已是數以千計, 不是20人。
請尊重和信任組織架構, 現時Mozilla 正式的香港代表是Sammy, 他會就Mozilla的發展和方向表達意見。

我去協助這個項目, 因為我是香港人, 因為我是香港人的身份, 因為我在香港這片土地出生和生活, 因為自己講的聲音要自己救。

請熱愛和守護這片土地, 香港人, 加油!

1 Like