Data Augmentation using a Text to Speech Pipeline

BenHoff · November 14, 2019, 2:43pm

Hey I’ve seen that Mozilla has some data augmentation methods mostly using Gaussian filters and other audio enhancement technologies.

Has there been any thoughts about data augmentation using a Text to Speech pipeline?

I was looking at Almost Unsupervised Text to Speech and Automatic Speech Recognition. I think for technical reasons, this wouldn’t work with DeepSpeech, but there are some references in the paper that I think make reference to a more similar setup to DeepSpeech, that could leverage a TTS model to generate a file.

My use case is going to have a lot of domain specific jargon and acronyms, so wanted to know if there were any options to feed in a list of words to have the system bootstrap.

Additionally interested to see if anyone has setup a voice-to-voice preprocessing step, such as those described in Google’s Parratron

lissyx · November 14, 2019, 2:56pm

Not that we know of. I guess there might be some legal issues here, other TTS services may forbid to do this.

It’s not impossible that using a better-built language model does address your needs better, in this case.

BenHoff · November 14, 2019, 3:11pm

You could use Mozilla’s TTS repo to do this though, right?

Using TTS for data augmentation (I assume) is less than ideal/perfect for training data. But it might be a good enough bandaid to bootstrap to a better trained model once you have real data.

Just to make sure I understand, would tweaking and changing the language model be the best use of limited resources in a bootstrap scenario as previously described? I had figured that data augmentation using a TTS for novel utterances would be a good resource utilization, but I’m open minded to any approach.

Thanks for the quick response!

lissyx · November 14, 2019, 4:10pm

I defer to @erogol regarding the status for this usecase.

If I understand your usecase well, I think it would be yes. It’s not impossible you would have to re-do a LM from scratch, the current one is huge and partly based on LibriSpeech which uses 1800s circa books so the English itself might be not so perfect. But adding your domain-specific jargon is obviously the best way, this was verified by ourselves as well as other contributors.

It might not be a quick and simple solution, but definitively more reliable and easier than involving data-augmentation through TTS as you had in mind.

erogol · November 14, 2019, 4:44pm

yes we can do TTS to ASR but it needs a good multi-speaker TTS model so that you can generate enough variety in your artificial dataset. So far, I could not work hard on multi-speaker case. There are some models I trained but I did not try for ASR.

I alos know people used the same idea under the name of “cyclic consistency”.

BenHoff · November 14, 2019, 5:47pm

Sounds good, I think the information as presented works for my purposes.

In case anybody else gets here, it looks like NVIDIA has successfully implemented data augmentation via STT as part of their OpenSeq2Seq repo. Not sure if there would be any license issues between Apache 2.0 and MPL 2.0, but see below:

https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition/synthetic_dataset.html#training-with-synthetic-data

A_N · May 7, 2020, 4:13pm

I was looking at some additional details on customizing the Language model (during inference) so that I can add technical terms that are specific to my dataset.
Using the documentation here I can generate the new lm.binary and vocab-50000.txt. My question is around the librispeech-lm-norm.txt file:

Can I just edit and add to this file?
How is the librispeech-lm-norm.txt file generated (sorry I could not find any docs on this in the LibriSpeech dataset).
If my dataset is related to say semi-conductor industry, can I use related journal material to augment librispeech-lm-norm.txt?

othiele · May 7, 2020, 4:17pm

Search this forum a bit, you’ll find a lot about custom language models and adapting them to specific use cases. For the 0.7 branch you’ll have to build a scorer instead of the trie and binary, but the generate… scripts do the heavy lifting for you. You need large plain text files with combinations of usage of your words as the input txt. The more the merrier.

Topic		Replies	Views
Deep Speech Development DeepSpeech	2	3825	November 24, 2017
Tools for applying data augmentation to wavs DeepSpeech	3	338	May 11, 2020
Recommended values for data augmentation DeepSpeech	1	424	August 26, 2020
How to handle low voices and female voices and improve engine performance DeepSpeech	4	373	October 16, 2020
Problem in voice corpus tool augment when converting stereo to mono DeepSpeech	7	915	August 12, 2019

Data Augmentation using a Text to Speech Pipeline

Related topics