Questions about the sentence collecting guidelines for Cantonese (yue)

This is @laubonghaudoi from @CanCLID. Mozilla just opened a new yue locale(Cantonese) and we are adding new sentences to it. However, we have some confusions about the guideline. According to the How to, new sentences should be:

  • Foreign letters. Letters must be valid in the language being spoken. For example, “ж” is a letter in the Russian alphabet but is never used in English and so should never appear in any English source text.
  • Length. Sentences must be 14 words or less.

And our questions are:

  1. Some Cantonese variants (e.g. Hong Kong Cantonese) has a lot of English loan words. It is inevitable to include sentences written in mixed-Chinese characters and English letters, unless we purge these sentences and remain those with pure Chinese characters. But this will significantly diminish the size of collection. So is it okay to add mixed-code sentences?
  2. East Asian languages do not have spaces in written texts, so sentences can’t be measured with the unit of “word”, only “character”. But if sentences are limited to only 14 characters or less, we can hardly include any meaningful sentences. So is there another criterion for measuring the lengths of East Asian texts?

Thanks for moving this discussion from GitHub to here. I can’t answer the first question, there are more knowledgeable contributors on here who might know more.

Generally the HOWTO does not necessarily reflect the case for specific languages apart from English, that’s true.

The 14 words for English is a measurement that results in the desired length of a clip. Of course that depends on speaking speed. In Cantonese that might be equivalent to 50 characters for example (I have no idea).

From a technical perspective, we could add a custom validator for Cantonese, which then can provide its own rules and when to reject a sentence when uploading. The code can be found here: https://github.com/common-voice/sentence-collector/tree/main/server/lib/validation . As you can see there for English it tries to split in words, but individual language files do not need to use that. The rules for Thai for example count the characters and limit that: https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/th.js