Tool to evaluate phonetic coverage of sentences DB with input method table on syllabaries language

irvin · July 24, 2018, 7:10pm

I would like to share a small command line toolkit I wrote, for evaluating the current coverage of Traditional Chinese of our sentence DB (with the phonetics-based input method table), as well as reduce duplicate sentences before submitting PR on Github.

It can calculate the following stats,

✗ node text-tools.js -c all.txt CnsPhonetic2016-08v2.cin

Total numbers of phonetic in CnsPhonetic2016-08v2.cin are 1567
Numbers of phonetic from 2015 characters in all.txt are 861
We have cover 54.95% of the pronunciations.

It means, in our current Trad. Chinese DB in Common Voice, we have cover 861 of 1567 phonetics of Trad. Chinese with 2015 non-repeat characters.

In other words, we’re only recording 55% of the possible pronunciation of Trad. Chinese. I need more sentences with characters that have different pronunciation.

Brief explains of how it works,

I found a phonetics-based input method table (.cin file), which include phonetic of all 105 thousands standards Trad. Chinese characters.
Calculate all the possible phonetics ways in the table, which is 1567.
Divided by numbers of different phonetics from characters of all the sentences we currently have (in a .txt file)

I believe this tool would be useful for other syllabary languages with the phonetic-based input method available.