- Your text corpus has many sentences, but not all of them are recorded yet. An important issue is that there are many multiples if you normalize them.
18.7% is a high value IMHO.
-
In the validated.tsv, you have only 53,772 different sentences, so, only a small portion of your text corpus has been covered (20.8% of the normalized unique sentences).
-
Total tokens (normalized) in all text-corpus is 48,009, probably you’ve got your 28,840 from validated.tsv file.
