Hi @david-song welcome to the community!
One of the main challenges of this project is to have a public-domain text dataset big enough to accommodate the thousands of unique hours we need to have a solid dataset.
The most successful approach we have done to get 2M+ sentences to read was the wikipedia extraction (where we still need help).
If you happen know another big source of sentences with a public-domain license we can use, it would be great to plan our next steps into evolving the wikipedia-extractor tool to also be able to extract and clean-up sentences from other sources.
Thanks for your feedback!