I’ve used this approach to create my Spanish dataset, you can read to see what I did : Releasing my Spanish dataset - 120h of public domain data
Now about the quality of the dataset is hard to tell, I need people to test it, I can’t simply manually review them, it is 110k files. I think probably a good idea is to sort them using the loss and start the review on the higher ones.