We are building an educational platform for economically disadvantaged kids aged 4 - 6 and are planning on incorporating Common Voice into it in order to help kids improve their English reading skills.
Initially, we plan to build a game where the child reads out individual words of a story, and our game gives feedback on whether or not the child has pronounced them correctly.
Before we do that, we obviously need a training data-set for kids voices.
We have the ability to collect kids voices, but before we start doing that, I’d love some advice from the community on exactly what to collect. Specifically:
How many unique kids’ voices should we aim to collect? I know more is better, but since we have limited budgets, what’s a realistic number of individual kids’ voices we are looking at in order to have a reliable training data-set?
For each kid, how many recorded words should we collect on average?
How many unique words do we need to have recorded voices for in the whole training data-set? Is it better to have a large number of unique words (but therefore a fewer number of sample recordings per word), or fewer unique words (and therefore a larger number of sample recordings per word)? What combination of number of unique words, and number of sample recordings per unique word, should we aim to get?
If you need more information from me to be able to answer the above questions, please let me know.