I got a question from a volunteer that I don’t have a good answer to. Maybe you could help.
So the question is - why not use data augmentation to make more data from existing ones? e.g. Add background noise (like cafe sounds) to existing recordings.
The closest answer I found on CV site is the note on TSS sound in Contribution Guidelines.
“Most recordings are of people talking in their natural voice. You can accept the occasional non-standard recording that is shouted, whispered, or obviously delivered in a ‘dramatic’ voice. Please reject sung recordings and those using a computer-synthesized voice.”
But that ain’t it, right(?) the question is not about synthetic voices but augmented ones.
I also did some Googling and there seem to be some positive opinions on using augmented as well as synthetic voices for STT training.
So, here are a few questions I’d like to ask to get my brain around this topic:
- Why don’t use TTS to in-reach CV data? My guess would be that you can use TTS for training STT, but CV data set just is not the place to do that.
- Can you augment CV data to make more and diverse data?
- If yes, what exact techniques of audio augmentation are ok for STT training, and what are not?
- If yes, is ratio of real audio to augmented one something to consider?