I’ve had another thought about the original topic of this thread.
Some sentences are more difficult to read than others, and some people have better reading comprehension than others. It ought to be possible to infer both of these by looking at how often a sentence is skipped, and how likely a user is to skip a sentence compared to other users.
Start by giving the user a low comprehension level and feed them sentences that match that level, plus some variance and unknowns. As their comprehension level increases, they get more difficult sentences.
This way, a new user with a poor reading level won’t be immediately put off, and their voice and accent will be trained on sentences that they’re more likely to utter.
Re: licensing, from the above post, I guess the collector can set the terms on anything that’s explicitly opt-in, so IRC, WhatsApp, Facebook, Twitter or mailing list posts won’t be a problem. Wiktionary and simple Wikipedia are a different issue though.