Problem with understanding part of the decoder logic

damiankwasny95 · May 28, 2019, 1:29am

Hello. I have a trouble understanding a certain line in the decoder forward method. This one: https://github.com/mozilla/TTS/blob/924d6ad4e55e7763e61933bf3542dae1e892c369/layers/tacotron.py#L405-L417

       memory_input, attention_rnn_hidden, decoder_rnn_hiddens,\
        current_context_vec, attention, attention_cum = self._init_states(inputs)
        while True:
            if t > 0:
                if memory is None:
                    new_memory = outputs[-1]
                else:
                    new_memory = memory[t - 1]
                # Queuing if memory size defined else use previous prediction only.
                if self.memory_size > 0:
                    memory_input = torch.cat([memory_input[:, self.r * self.memory_dim:].clone(), new_memory], dim=-1)
                else:
                    memory_input = new_memory

So my question is, what does this concatenation do? We have a new_memory with size (B, r*memory_channels) and in step t=1 when we can enter this part of the condition, the current memory_input we are trying to concatenate with is of shape (B, memory_size*memory_channels). So in case memory_size is as a default set to memory_size = r this line literally concatenates with an empty tensor. The only situation it actually concatenates with something is only when memory_size is higher then r, am I right? shouldn’t the condition be changed to if self.memory_size > r or something like that in this case?

Maybe I am missing something, but I am stuck on this line for like more then half an hours already and I cannot see a different solution. I would be glad for clarification.

erogol · May 28, 2019, 8:55am

if memory_size > 0, then TTS uses a constant queue regardless of r. So if the queue is active, it basically queues the incoming frames. Otherwise, it uses the previous prediction as memory_input.

damiankwasny95 · June 3, 2019, 4:52am

Thanks for the reply! I am still not sure if I understand this part correctly, but it seems that the code changed anyway, so I will try to dig into once again and comeback here in case I still don’t grasp it. Have a nice day!

Topic		Replies	Views
Vocoder (dev) - potential memory usage issue in training? TTS (Text-to-Speech)	14	1083	August 2, 2020
Infer error with Tacotron2+WaveNet TTS (Text-to-Speech)	3	761	March 18, 2020
OSError: [Errno 12] Cannot allocate memory TTS (Text-to-Speech)	5	1150	February 1, 2021
Size mismatch for decoder.stopnet.1.linear_layer.weight: copying a param with shape torch.Size([1, 1584]) from checkpoint, the shape in current model is torch.Size([1, 1104]) TTS (Text-to-Speech)	2	3508	November 29, 2020
Decoder stopped with 'max_decoder_steps TTS (Text-to-Speech)	3	2857	October 9, 2020

Problem with understanding part of the decoder logic

Related topics