Contributing many sentences that contain not yet spoken words

  1. Your text corpus has many sentences, but not all of them are recorded yet. An important issue is that there are many multiples if you normalize them.

18.7% is a high value IMHO.

  1. In the validated.tsv, you have only 53,772 different sentences, so, only a small portion of your text corpus has been covered (20.8% of the normalized unique sentences).

  2. Total tokens (normalized) in all text-corpus is 48,009, probably you’ve got your 28,840 from validated.tsv file.