Query regarding post processing

carlfm01 · September 18, 2019, 3:31am

I’m back to this topic

Please read : https://arxiv.org/pdf/1703.10135.pdf (3.3 Decoder) the decoder sections will help you understand what’s going on.

From what I noticed trying to train with lower batch_size, If you see a good alignment and then it breaks almost sure it wont align back.

Same here, from my experience testing TTS and different Tacotron versions I think is better to throw away data rather than lowering the batch size.With TTS is really easy to find a good balance using the max lenght.

For tacotron2(not TTS) what I did was to sort the text using a text editor and remove the longer ones manually, most of the time just a few very long sentences ruins the whole thing.

carlfm01 · September 18, 2019, 4:35am

@erogol Hello, is OK to share my tests here in the forum even if they are not 100% related to Mozilla TTS but TTS in general?

FYI, I think I’ve solved the issue, Tacotron2 was using a “target mel scale”, I removed that scale clipping and now looks promising.

With just 5k steps the attention looks good, and the audio too. My previous attempts required at least 60k steps to start seeing the align.

10k step audios:
10k.zip (317,3 KB)

alchemi5t · September 18, 2019, 5:15am

Good to see you back here!

I did read that, was wondering if someone could shine a light on the values and direct implications on memory, speed and alignment time for this implementation. (If anyone has logged that.)

I’ve not removed the sentences but i have decreased the max seq len to 200, still not able to run r=1 at 32 batch_size though.

Hope that’s a yes. I’d love to see what you’re working on and how it’s working out for you.

carlfm01 · September 18, 2019, 5:29am

I think they only way, right now is to lower the max seq, there’s an issue:Dynamic Batch Sizes · Issue #183 · mozilla/TTS · GitHub about OOM.

Hows your length distribution? If you go lower than 200 will lose a lot of data?

alchemi5t · September 18, 2019, 6:19am

 r=1;  batch_size=32;
| > Number of instances : 9489

with a max len of 200,
| > Num. instances discarded by max-min seq limits: 684
-OOM

with a max len of 150,
| > Num. instances discarded by max-min seq limits: 1610
-OOM

with a max len of 100,
| > Num. instances discarded by max-min seq limits: 3591
-OOM

with a max len of 50,
| > Num. instances discarded by max-min seq limits: 7338
works but i’ve lost 80% of my data.

I think I am going to have to look into the dynamic batch size hack.

For reference:
Using 0,1,2 at max len 50(will go as high as i can)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:18:00.0 Off |                  N/A |
| 56%   77C    P2   105W / 250W |   8813MiB / 12066MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:3B:00.0 Off |                  N/A |
| 59%   82C    P2   131W / 250W |   5823MiB / 12066MiB |     58%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:86:00.0 Off |                  N/A |
| 51%   83C    P2    91W / 250W |   5981MiB / 12196MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:AF:00.0  On |                  N/A |
| 32%   51C    P5    26W / 250W |   1095MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

carlfm01 · September 18, 2019, 7:45am

I sort of recall using using 100/200 with a k80 11GB, single GPU, then when I tried dual GPU it required to low a bit the max length, you get the same results using a single GPU?

erogol · September 18, 2019, 2:25pm

Feel free to share anything related.

What do you mean btw by “target mel scale”?

carlfm01 · September 18, 2019, 9:37pm

Thanks.

github.com/Rayhane-mamah/Tacotron-2

tacotron/synthesizer.py

ab5cb08a9


      
          		self.session.run(tf.global_variables_initializer())
          
          		saver = tf.train.Saver()
          		saver.restore(self.session, checkpoint_path)
          
          
          	def synthesize(self, texts, basenames, out_dir, log_dir, mel_filenames):
          		hparams = self._hparams
          		cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
          		#[-max, max] or [0,max]
          		T2_output_range = (-hparams.max_abs_value, hparams.max_abs_value) if hparams.symmetric_mels else (0, hparams.max_abs_value)
          
          		#Repeat last sample until number of samples is dividable by the number of GPUs (last run scenario)
          		while len(texts) % hparams.tacotron_synthesis_batch_size != 0:
          			texts.append(texts[-1])
          			basenames.append(basenames[-1])
          			if mel_filenames is not None:
          				mel_filenames.append(mel_filenames[-1])
          
          		assert 0 == len(texts) % self._hparams.tacotron_num_gpus
          		seqs = [np.asarray(text_to_sequence(text, cleaner_names)) for text in texts]

I’ve removed everything that touches the prediction and now is working fine, T2_output_range well I think is ok to call it the “output/target scale”, or am I wrong?

carlfm01 · September 18, 2019, 10:22pm

I see gaps in the alignment, saw the same gaps while I was training TTS, I guess they are data related. About the audios I don’t hear a significative improvement from 10k to 25k, also I don’t hear an expressive speech on questions and special characters. I think is related to the speaker, the source voice is so flat. I’m cutting a more expressive female speech to adapt using the trained model, hopefully the issue is not the LPCNet being unable to be expressive over Tacotron predictions.

Examples:
25k.zip (960,4 KB)

alchemi5t · September 19, 2019, 4:38am

I haven’t tried yet, I am going to let this model train on 2000 sents and see what r=1 actually gives me in terms of quality of generated audio.(because according to 3.3 in the paper, thay’ve only discussed the major pros of having r>1 and not what the tradeoffs are, if any.)

I’ve trained it for around 30k steps and the quality is much better than what i had at r=2 but not better than waveRNN. I have to figure out some way to make it hog less VRAM so that i can acutally train the Taco2 on my entire dataset( followed by another wavernn training sess).

Topic		Replies	Views
Training suddenly dropping in quality TTS (Text-to-Speech)	20	2471	August 18, 2020
Audio generated with TTS is a Bip TTS (Text-to-Speech) learning	4	2129	March 10, 2021
RuntimeError: CUDA out of memory TTS (Text-to-Speech)	2	2637	June 8, 2020
What are the TTS models you know to be faster than Tacotron? TTS (Text-to-Speech)	62	14650	April 25, 2021
Tacotron2 + PWGAN produces Deep/Muffled Voice TTS (Text-to-Speech)	9	2985	June 7, 2021

Query regarding post processing

Related topics