Query regarding post processing

alchemi5t · August 23, 2019, 6:36pm

I see a few differences in our configs.

Let me just list out what I would do:

train taco2 for long enough that it generates decent mel specs (test and train inputs), I.E., make sure you taco2 trained till it’s a decent tts with Griffin lim(not the quality of the voice, but how easily words are discernable, intonations e.t.c. For clarity, we’ll train wavernn).
get erogol’s fork, the latest branch( cause I haven’t tried the fatchords, so cannot comment). There are a few parameters missing in your config file, without which it should throw errors(I’ll have to wait till monday to give you the exact params, but you shouldn’t have a problem adding those in while debugging the errors) . Add those in and you should be good.
Also, I trim silences for training taco2 and i also had to trim silences in my wavernn. This was the major issue I had for the gibberish production. [Mrgloom on the github issues page(wavrnn+tacotron2 experiments) had something to say about this. He should be right with his premise that the mel specs aren’t obliged to be aligned with the audio(because of trimming).]

This is all I have for you right now. I’ll keep adding as and when I remember.

Good luck!

Also, from scratch, there is some semblance to human speech within 7ksteps, albeit with a lot of noise, but that should let you know if you’re screwed or not much earlier on(YMMV on this.)

Just fyi, getting high quality results after 130k. Top notch!

carlfm01 · August 25, 2019, 3:34pm

Hello @alchemi5t, thanks for sharing

Yes, I’m using erogol’s fork with the recommended commits.

I’m using deepspeech to preprocess and trim silence using the metadata feature, thus I disable the trimming.

Now hitting this issue with my new VM, now I don’t remember the pytorch version that bypassed the issue on my previous vm

alchemi5t · August 26, 2019, 3:51am

Just put that line of code into the “if C.model == “Tacotron”:”'s scope. I have trained 2 wavernn models which work just fine with said modification.

I am not sure what you mean by that. What I understand is, you use ASR to preprocess and trim the file and the trimmed files are referenced to as audio in both training of TTS+GL and WaveRNN, If so, that should work fine. If your audio and generated melspec are not aligned is when you start to generate gibberish.

carlfm01 · September 9, 2019, 6:29am

Hello @alchemi5t I’ve tried your suggestion with other param combinations but, no luck, same output. While I was training erogol’s WaveRNN, I was also training LPCNet with impresive results using the extracted features. The results conviced me to try and combine taco2+LPCNet, here’s my result from the first try (with the same dataset):
lpcnet taco2.zip (3,0 MB)

I still need to figure out where taco2 decreased the volume, a filter somewhere ruined the output:

With tacotron2 features (weird from the middle to the top):

With extracted features:

About speed:
LCPNet is 3x faster than real time with AVX enabled on my very old AMD FX8350(only using one core), I’ve not tried with AVX2 enabled yet. The issue here is with tacotron2, is not real time, as erogol mentioned the 29m params of tacotron2 vs 7m params that TTS has is not a deal. Mixing TTS and LPCNet maybe is a good experiment, but to build a client using TTS and LPCNet we need to find out a way to mix pytorch and tensorflow (for me looks hard to achieve) or convert the tts model to tensorflow model (I know a couple of tools we can try).

I know @reuben is working on a client, maybe share insights? FYI I managed to compile LPCNet for Windows using Bazel.

I’ll share the trained models tomorrow if someone wants to fine tune.

alchemi5t · September 9, 2019, 9:56am

That’s interesting. Happy to see new leads here.

I am sorry I couldn’t be of much help here. I’ll keep working on it to see if I find any hints.

This is very promising, I’ll start experimenting on this soon.

Also, have you successfully trained on multiple GPUS? I believe I have a pytorch issue which is hanging on dist.init_process_group( near line 84 in distribute.py).

I checked p2p communication and I got this:

Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

so I shouldn’t have problems training with 0,1 or 2,3 but the training hangs indefinitely. I can’t train on r=1 because of memory overflow. Any insights?

I’ve seen a lot of similar issues, most still open, probably because it silently(w/o errors or logs) hangs.

carlfm01 · September 9, 2019, 7:43pm

Yes, with 4 k80s and torch 04.1 for TTS, for WaveRNN only 1 GPU.

Try reducing the max lenght of the sentences.

Or maybe related to different GPUs? https://devtalk.nvidia.com/default/topic/1023825/peer2peer-with-2-nvidia-cards-geforce-gtx-1080-ti-and-titan-x/

see last comment

alchemi5t · September 10, 2019, 3:59am

I wanted to only train on titan Vs. I fixed it by installing apex. Did you need apex?

carlfm01 · September 10, 2019, 4:04am

No, for multi GPU with TTS just worked, no extra tricks.

carlfm01 · September 10, 2019, 11:05pm

Here’s the LPCNet trained model:

Trained using :https://github.com/MlWoo/LPCNet

For the tacotron2 model I’m still testing where’s the filter, once I get it properly working I’ll share it.

alchemi5t · September 12, 2019, 11:15am

Thank you for sharing this.

I don’t understand what you mean by this. What are you trying to solve and how? ( Just curious)

@erogol @carlfm01

I’ve not managed to train a good model with
"r": 1, // Number of frames to predict for step.

Because of various reasons. Memory being the top issue; to deal with this, i reduced the batch size to 16(documented : Batch size for training. Lower values than 32 might cause hard to learn attention) which works without any OOM errors but it takes way too long to train and after 30k steps(which i know is low), the generated test audio is just empty whereas when i switch r=2, the generated audio has certain speaker qualities by 1.6k steps and by 20k steps you can easily discern what the TTS is trying to produce.

I have 4 questions.

what are the implications of increasing or decreasing r.
why does the small batch size not affect the training when r=2
would training the r=1 model with 16 sized batch for longer be worthwhile.( I understand this can’t objectively be answered without experimentation. wondering in case someone has done similar experiments)
Why is the memory requirement higher for r=1?

carlfm01 · September 18, 2019, 3:31am

I’m back to this topic

Please read : https://arxiv.org/pdf/1703.10135.pdf (3.3 Decoder) the decoder sections will help you understand what’s going on.

From what I noticed trying to train with lower batch_size, If you see a good alignment and then it breaks almost sure it wont align back.

Same here, from my experience testing TTS and different Tacotron versions I think is better to throw away data rather than lowering the batch size.With TTS is really easy to find a good balance using the max lenght.

For tacotron2(not TTS) what I did was to sort the text using a text editor and remove the longer ones manually, most of the time just a few very long sentences ruins the whole thing.

carlfm01 · September 18, 2019, 4:35am

@erogol Hello, is OK to share my tests here in the forum even if they are not 100% related to Mozilla TTS but TTS in general?

FYI, I think I’ve solved the issue, Tacotron2 was using a “target mel scale”, I removed that scale clipping and now looks promising.

With just 5k steps the attention looks good, and the audio too. My previous attempts required at least 60k steps to start seeing the align.

10k step audios:
10k.zip (317,3 KB)

alchemi5t · September 18, 2019, 5:15am

Good to see you back here!

I did read that, was wondering if someone could shine a light on the values and direct implications on memory, speed and alignment time for this implementation. (If anyone has logged that.)

I’ve not removed the sentences but i have decreased the max seq len to 200, still not able to run r=1 at 32 batch_size though.

Hope that’s a yes. I’d love to see what you’re working on and how it’s working out for you.

carlfm01 · September 18, 2019, 5:29am

I think they only way, right now is to lower the max seq, there’s an issue:https://github.com/mozilla/TTS/issues/183 about OOM.

Hows your length distribution? If you go lower than 200 will lose a lot of data?

alchemi5t · September 18, 2019, 6:22am

 r=1;  batch_size=32;
| > Number of instances : 9489

with a max len of 200,
| > Num. instances discarded by max-min seq limits: 684
-OOM

with a max len of 150,
| > Num. instances discarded by max-min seq limits: 1610
-OOM

with a max len of 100,
| > Num. instances discarded by max-min seq limits: 3591
-OOM

with a max len of 50,
| > Num. instances discarded by max-min seq limits: 7338
works but i’ve lost 80% of my data.

I think I am going to have to look into the dynamic batch size hack.

For reference:
Using 0,1,2 at max len 50(will go as high as i can)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:18:00.0 Off |                  N/A |
| 56%   77C    P2   105W / 250W |   8813MiB / 12066MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:3B:00.0 Off |                  N/A |
| 59%   82C    P2   131W / 250W |   5823MiB / 12066MiB |     58%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:86:00.0 Off |                  N/A |
| 51%   83C    P2    91W / 250W |   5981MiB / 12196MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:AF:00.0  On |                  N/A |
| 32%   51C    P5    26W / 250W |   1095MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

carlfm01 · September 18, 2019, 7:45am

I sort of recall using using 100/200 with a k80 11GB, single GPU, then when I tried dual GPU it required to low a bit the max length, you get the same results using a single GPU?

erogol · September 18, 2019, 2:25pm

Feel free to share anything related.

What do you mean btw by “target mel scale”?

carlfm01 · September 18, 2019, 9:37pm

Thanks.

github.com

Rayhane-mamah/Tacotron-2/blob/ab5cb08a931fc842d3892ebeb27c8b8734ddd4b8/tacotron/synthesizer.py#L78


		self.session.run(tf.global_variables_initializer())


		saver = tf.train.Saver()
		saver.restore(self.session, checkpoint_path)




	def synthesize(self, texts, basenames, out_dir, log_dir, mel_filenames):
		hparams = self._hparams
		cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
		#[-max, max] or [0,max]
		T2_output_range = (-hparams.max_abs_value, hparams.max_abs_value) if hparams.symmetric_mels else (0, hparams.max_abs_value)


		#Repeat last sample until number of samples is dividable by the number of GPUs (last run scenario)
		while len(texts) % hparams.tacotron_synthesis_batch_size != 0:
			texts.append(texts[-1])
			basenames.append(basenames[-1])
			if mel_filenames is not None:
				mel_filenames.append(mel_filenames[-1])


		assert 0 == len(texts) % self._hparams.tacotron_num_gpus
		seqs = [np.asarray(text_to_sequence(text, cleaner_names)) for text in texts]

I’ve removed everything that touches the prediction and now is working fine, T2_output_range well I think is ok to call it the “output/target scale”, or am I wrong?

carlfm01 · September 19, 2019, 12:21am

I see gaps in the alignment, saw the same gaps while I was training TTS, I guess they are data related. About the audios I don’t hear a significative improvement from 10k to 25k, also I don’t hear an expressive speech on questions and special characters. I think is related to the speaker, the source voice is so flat. I’m cutting a more expressive female speech to adapt using the trained model, hopefully the issue is not the LPCNet being unable to be expressive over Tacotron predictions.

Examples:
25k.zip (960,4 KB)

alchemi5t · September 19, 2019, 4:38am

I haven’t tried yet, I am going to let this model train on 2000 sents and see what r=1 actually gives me in terms of quality of generated audio.(because according to 3.3 in the paper, thay’ve only discussed the major pros of having r>1 and not what the tradeoffs are, if any.)

I’ve trained it for around 30k steps and the quality is much better than what i had at r=2 but not better than waveRNN. I have to figure out some way to make it hog less VRAM so that i can acutally train the Taco2 on my entire dataset( followed by another wavernn training sess).