This is @laubonghaudoi from @CanCLID. Mozilla just opened a new
yue locale(Cantonese) and we are adding new sentences to it. However, we have some confusions about the guideline. According to the How to, new sentences should be:
- Foreign letters. Letters must be valid in the language being spoken. For example, “ж” is a letter in the Russian alphabet but is never used in English and so should never appear in any English source text.
- Length. Sentences must be 14 words or less.
And our questions are:
- Some Cantonese variants (e.g. Hong Kong Cantonese) has a lot of English loan words. It is inevitable to include sentences written in mixed-Chinese characters and English letters, unless we purge these sentences and remain those with pure Chinese characters. But this will significantly diminish the size of collection. So is it okay to add mixed-code sentences?
- East Asian languages do not have spaces in written texts, so sentences can’t be measured with the unit of “word”, only “character”. But if sentences are limited to only 14 characters or less, we can hardly include any meaningful sentences. So is there another criterion for measuring the lengths of East Asian texts?