I would like to share a small command line toolkit I wrote, for evaluating the current coverage of Traditional Chinese of our sentence DB (with the phonetics-based input method table), as well as reduce duplicate sentences before submitting PR on Github.
It can calculate the following stats,
✗ node text-tools.js -c all.txt CnsPhonetic2016-08v2.cin
Total numbers of phonetic in CnsPhonetic2016-08v2.cin are 1567
Numbers of phonetic from 2015 characters in all.txt are 861
We have cover 54.95% of the pronunciations.
It means, in our current Trad. Chinese DB in Common Voice, we have cover 861 of 1567 phonetics of Trad. Chinese with 2015 non-repeat characters.
In other words, we’re only recording 55% of the possible pronunciation of Trad. Chinese. I need more sentences with characters that have different pronunciation.
Brief explains of how it works,
- I found a phonetics-based input method table (.cin file), which include phonetic of all 105 thousands standards Trad. Chinese characters.
- Calculate all the possible phonetics ways in the table, which is 1567.
- Divided by numbers of different phonetics from characters of all the sentences we currently have (in a .txt file)
I believe this tool would be useful for other syllabary languages with the phonetic-based input method available.