I chose a dialect of Chinese, zh-yue, which has far less wiki pages for experiment. It corresponding to zh-HK in the CV project. I got zh_yuewiki-20200520-pages-articles-multistream.xml, a 266M file, from Wikipedia. The rules file was basically copied from en.toml, and tweaked the parameters for extracting more sentence, e.g. min_trimmed_length = 0, min_word_count = 0, max_word_count = 1000, disallowed_symbols = [], etc. The result has only 247 of sentences, and only 1/10 of them are purely Chinese which are potentially useful for CV project.
If zh-cn was successful before, I would be more than happy to look at the rules file for reference.
A quick look at the source code give me the impression that it does define a set of Chinese punctuation:
static PUNCTUATIONS: [char; 37] = [
'"', '"', '、', '‧', '—', '—', '—', '~', '“', '”', ';', '·', ':', '‘',
'•', '─', '兀', '∶', '∧', '∨', ',', '、', '.', ';', ':', '#', '&',
'*', '+', '-', '<', '>', '=', '$', '%', '@', ',',
];
I notice this branch (mandarin) has no update for 10 months. What’s its status? Will it be released for extracting Chinese family languages (currently zh-cn, zh-tw, zh-hk in CV project)? Or its functionality will be merged into the master branch?