It took English a few years to reach the first goal of 1200 hours, and if we really want to reach 10 000 hours for many languages we will have to look into new ways of collecting samples.
If 1200 hours is enough to create an MVP speech recognition software, then we could use this software to collect more data from actual sound files that don’t come from the Common Voice website. People are often offering their audio files when they hear about the project and right now there is no way to use them.
All we need is:
A tool to cut recorded audio into chunks of more or less a sentence. (AFAIK there are libraries for this that can detect words and pauses)
A speech recognition software that creates a first transcript of these chunks
A website where this transcript can be validated or corrected.
People could donate their youtube-channels or their podcasts and one could use old public domain sources easily. These sources would be much closer to the reality of many people than another nine thousand hours of Wikipedia sentences read out loud.
I think this is a great idea, I have two youtube channels with 50+ hours of speech in english and portuguese, I would donate the clips to be validaded.
2 Likes
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
3
This is really interesting, thanks for opening the topic.
My understanding of your proposal is to get a tool trained with the English deepspeech model to read donated audios, and then we have people to validate/correct?
@rosana@kdavis would this be helpful if community collects these kind of data?
This is very interesting, thank you. I will have a look at it.
Writing a script that transcribes the files after that shouldn’t be that hard, but creating a good frontend to validate and correct is likely a bigger project.