Is data augmentation used for STT training?

I got a question from a volunteer that I don’t have a good answer to. Maybe you could help.

So the question is - why not use data augmentation to make more data from existing ones? e.g. Add background noise (like cafe sounds) to existing recordings.

The closest answer I found on CV site is the note on TSS sound in Contribution Guidelines.

“Most recordings are of people talking in their natural voice. You can accept the occasional non-standard recording that is shouted, whispered, or obviously delivered in a ‘dramatic’ voice. Please reject sung recordings and those using a computer-synthesized voice.

But that ain’t it, right(?) the question is not about synthetic voices but augmented ones.

I also did some Googling and there seem to be some positive opinions on using augmented as well as synthetic voices for STT training.

So, here are a few questions I’d like to ask to get my brain around this topic:

  1. Why don’t use TTS to in-reach CV data? My guess would be that you can use TTS for training STT, but CV data set just is not the place to do that.
  2. Can you augment CV data to make more and diverse data?
  3. If yes, what exact techniques of audio augmentation are ok for STT training, and what are not?
  4. If yes, is ratio of real audio to augmented one something to consider?
1 Like

Let me summarize what I know and what I experienced (not in the order of your questions) - but I’m in no way an expert in this area:

  1. It is perfectly fine to use augmentation in STT training and it is commonly used.
  2. AFAIK, there are two methods of doing it: (a) directly applying to the base data to duplicate it (which you mentioned), (b) the SpecAugment method, which is applied on Log-MEL spectrograms. You can even use both in some cases.
  3. For the first case, you can use libraries like musan to add noise, speech, and/or music on top of the originals (mixing). This will duplicate your input data, and therefore the training duration increases as it should (e.g. 3 day training becomes 7-8 days).
  4. In the second case (SpecAugment), you can only play with some aspects of the sound, such as pitch, speed, etc., because you only have frequency domain vs power in a specific frequency in it. The data that you input is the same, and you apply it in your code, so it is mainly much more efficient.
  5. Both of these might help you get better accuracy, but to a limited extent, e.g. 10-15% relative improvement (e.g. WER=0.20 drops to 0.17 - I’ll stand corrected if anyone has better results). It depends on the dataset, language, model architecture, method, and your settings while applying it. Therefore, it might be a good idea to experiment with the values (a round-up) and then get a working model where you say it is ok for you, then apply augmentation for the final run (and see the results). That would make your experiments more cost-effective (time, power & CO2 footprint).
  6. A last point is about Common Voice: Part (if not most) of the data is somewhat already augmented. It is not a clean dataset. If we had a cleaner dataset like Fleurs etc, we could play with it a bit more. But there are already many disturbances in the recordings, high SNR, and background noises/speech are already there for some recordings. In an ideal world, one can determine “clean recordings” and apply the first method only to them. One can use some math to find SNR value, tag recordings as clean or not by listening to them one by one, etc (all of these are in my head and never had time and energy for them).

Here is one SpecAugment application I used with Coqui STT (Deepspeech architecture) in the past (values based on the technical paper by @Francis_Tyers & Josh Meyer here ):

And here is how to use the (limited) SpecAugment support in OpenAI Whisper (my code and results are not yet ready & public):

In my experiments, I did not see any recognizable results with Whisper’s SpecAugment, but in one experiment I did with v13 or v14, I used the musan library (on the whole train & dev splits, which resulted in ~600 GB uncompressed data on disk) and dropped WER from 8.x to 5.x (I don’t remember exactly but it was more than I expected, I have them in backups), but it took more than twice as much time to train and I did not try that again. I think I need to “control” my dataset much better by moderating it (e.g. getting out bad data, like wrong recordings that passed, very bad cracks etc). I think this will help more than augmentation. After that, I can use the method.

To say something more general, one needs to run it on multiple languages/datasets with several settings, methods, etc, but it takes quite a lot of time. I’ll check some papers if somebody did that. It is on my summer to-do list thou…

About your first question (TTS usage): I think it is not a good idea, there are some limited free voices out there, and that’s it, I wouldn’t bother… Actually what OpenAI did with Whisper was similar (“borrowed” audio from YouTube videos), they called it “weekly supervised”, an invented semi-supervised technique, further extended with another invented “pseudo-labeled” data, and that’s why the models get stuck at some point. Here are some of my views about these: 1, 2

I even have an English sentence, very low voice/whispered ones (validated and recorded), which are in my test set - pufff.

Edit-1: Some additional notes / findings (also for me to read in detail):

  • You can change the speed of speech. What if the person is already speaking very rapidly? Maybe a more intelligent idea will be to only speed up if the char-speed is already low, or vice versa.
  • Here is a paper on using augmentation for L1/L2 speaker bias, it seems to help.
  • Here are some results on some models on LibriSpeech data (which is clean).
  • Here is a paper, more like meta-analysis. As it states in the abstract, augmentation also increases the robustness of the model.

Edit-2: Another note:

  • We know the problem which could be caused by using many recordings from individual voices. In most cases they will be similar in accent, speed, intonations etc. If you apply augmentation for the train & dev set, these voices will change in these respects, resulting in more “different sounding” voices. So, if we are capping to N recs/voice, we can increase this threshold, with peace in mind.

To add to @bozden’s excellent summary - my opinion here is that Common Voice should remain the authoritative “clean” dataset which can be augmented. That is, it should not contain augments itself.

There are some open research questions here:

  • Do particular data augmentation strategies (e.g. pitch, prosody) affect accuracy for some voices more than others? For example, female-identifying people tend to speak at a different pitch to male-identifying people. We don’t know whether, for example, data augmentation works differently for various accents of speech - although I’d love to find out!

  • Does data augmentation help for particular phonemes which may be poorly recognised? For example, Common Voice is not controlled with respect to phonetic distribution - that is, the frequency distribution of phonemes may not match natural speech. Might data augmentation help with phonetic distribution or phonetic recognition? I’m not aware of research in this space.

  • A growing area of research is augmenting speech recognition with named entities (the names of people, products and places) which in Common Voice, don’t match the target deployment environment. For example, using a Common Voice-trained model for a Fast Food Restaurant may not work, because there are not a lot of sentences like “could I please have an AwesomeBurger with SaltyFries and a ChokkoShake?” in Common Voice (in this example, AwesomeBurger, SaltyFries and ChokkoShake are (fictional) named entities of items at a fast food restaurant).

But no, the Common Voice dataset itself should not be augmented.


@bozden @kathyreid Thank you both for your answers :pray:. I have better picture of the topic now.