Using speech recognition software to collect more data

stergro · April 7, 2020, 12:55pm

Hey everyone,

It took English a few years to reach the first goal of 1200 hours, and if we really want to reach 10 000 hours for many languages we will have to look into new ways of collecting samples.

If 1200 hours is enough to create an MVP speech recognition software, then we could use this software to collect more data from actual sound files that don’t come from the Common Voice website. People are often offering their audio files when they hear about the project and right now there is no way to use them.

All we need is:

A tool to cut recorded audio into chunks of more or less a sentence. (AFAIK there are libraries for this that can detect words and pauses)
A speech recognition software that creates a first transcript of these chunks
A website where this transcript can be validated or corrected.

People could donate their youtube-channels or their podcasts and one could use old public domain sources easily. These sources would be much closer to the reality of many people than another nine thousand hours of Wikipedia sentences read out loud.

Do you think this could be a way to go?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 7, 2020, 1:33pm

I think this is a great idea, I have two youtube channels with 50+ hours of speech in english and portuguese, I would donate the clips to be validaded.

nukeador · April 7, 2020, 2:39pm

This is really interesting, thanks for opening the topic.

My understanding of your proposal is to get a tool trained with the English deepspeech model to read donated audios, and then we have people to validate/correct?

@rosana @kdavis would this be helpful if community collects these kind of data?

stergro · April 7, 2020, 8:01pm

Correct

nomad_skateboarding · April 9, 2020, 1:38am

Here’s a Git project from Deep Speech by @Tilman_Kamp

It seems to achieve the goal, but also seems too obvious an answer for others not to comment?

Also being discussed on forum here

Either way, great thought (and if I’m way off, kindly let me know )

stergro · April 10, 2020, 8:14pm

This is very interesting, thank you. I will have a look at it.

Writing a script that transcribes the files after that shouldn’t be that hard, but creating a good frontend to validate and correct is likely a bigger project.

Topic		Replies	Views
My language is now collecting voice, what do I need to know? Common Voice participation	2	4091	July 1, 2020
Sharing my 100h of single speaker (Spanish) TTS (Text-to-Speech)	6	2463	September 20, 2019
10,000 hours of validated speech Common Voice	4	2165	June 25, 2018
Using Common Voice data with DeepSpeech Common Voice	11	7543	August 21, 2021
Use deepspeech as one positive validation Common Voice feedback	5	1177	June 9, 2019

Using speech recognition software to collect more data

Related topics