Addendum about your calculations on “how much time or recording needed per day”.
For simplicity, suppose these:
- Average recording duration in your corpus is 3.6 seconds. So 1 hour recording means 1000 recordings.
- It is good to have multiple recordings per sentence, from diverse genders/ages/accents. Say, you aim for 3 recordings per sentence on the average.
- Suppose your language and model requires 1000 hours to give good results for your application.
- Suppose your community can produce 1 hour VALIDATED recordings per day (i.e. 1000 recordings) on the average.
As a results, to reach your 1000 hours goal:
- You would need ~333k different sentences.
- You need 1000 days (2.74 years) to reach your goal.
I’m assuming you use ALL validated recordings. If you limit that, it will take longer.
In my experience, 1 hour validated recordings per day can be reached only with a large contributor base (e.g. English), or with constant effort from community leads who can direct many campaigns - required in our case with ~1000 diverse voices.