iveskins (not sure how to pronounce that)
I would not worry too much about the details. If we were doing TTS we would want to try to sweat the triphone or pentaphone coverage, and we might consider selecting individual phrases because they add unseen pentaphones. But for ASR as long as we have a good variety of text we should be OK. The most important rule of collecting training data is that it should look like the data you are testing on (in multiphone content, speaking style, accent, background noise, etc). Let the models figure everything else out.
For computing my statistics above I combined a number of publicly available dictionaries:
CMUDICT, plus those that are included with the
TED-LIUM2 distributions (see http://www.openslr.org/resources.php), and stripped out stress. I mostly ignored issues to do with OOV, multiple prons (pick the first one), cross-word vs. word-internal, etc. I also extracted all the text from the
csv files (but I imagine the set of prompts being used is available elsewhere).
A short python script will compute everything very easily. (I could probably provide my script, if you needed it.)
I want to make one point about triphones/pentaphones. Modern ASR systems (except maybe
CTC systems) use pentaphones. But even if your system uses triphones, you still want to compute pentaphone coverage, because this gives you a better idea of the variation in the data. Say you have 10,000 training examples of a particular triphone; they might all come from the same word, rather than from different words which happen to share a triphone (and you would definitely prefer this). Pentaphones don’t exactly compute this, but they capture some of it.
I hope this answers your questions.