Common voice sentences are the opposite of "common"

Hi @david-song welcome to the community!

One of the main challenges of this project is to have a public-domain text dataset big enough to accommodate the thousands of unique hours we need to have a solid dataset.

The most successful approach we have done to get 2M+ sentences to read was the wikipedia extraction (where we still need help).

If you happen know another big source of sentences with a public-domain license we can use, it would be great to plan our next steps into evolving the wikipedia-extractor tool to also be able to extract and clean-up sentences from other sources.

Thanks for your feedback!