How do I create a preprocess script for a custom dataset?

Hi. I am new in this area so I was searching in wiki FAQ but I didn’t find a solution to preprocess a custom dataset. I want to use MozillaTTS with a own spanish dataset. Could you please help me? Thanks a lot.

Am confused. Did you read my answer above too?

And you looked at the rest of the wiki right? It’s not that much to skim.

Anyway, to save you time, here’s the page you need to understand processing https://github.com/mozilla/TTS/wiki/Dataset

Preprocessing that’s being discussed here is loading the data, so it’s not generally going to depend the language, it’ll depend on the format of your data.

One possiblity is that you’re thinking more about the cleaning functions for your text, and if so then you’d need to look at the code here (especially cleaners.py). I haven’t worked with non-English transcriptions but I’m guessing it would be similar(ish) with Spanish but there are bound to be some language differences. If it’s this you’re after advice on, I know there are others here who’ve worked in other languages so maybe they can help. Probably still worth looking over the code I link to though, so you’ll have an idea of what it’s doing with English and can then think about what would be different with Spanish.

Hi. I tried MozillaTTS in a PC (Windows 10) and I didn’t have problems to synthesize some audios using some demo you published in github. BUt when I tried MozillaTTS in an other PC (WIndows 10) when I tried to synthesize It wasn’t possible, but the message was:

key already registered with the same priority: GroupS partial Softmax

What can I do to solve this problem?
Thanks a lot

@luis.vera.heredia you would need to look at why the set up between the computers is different.

What you’ve given above isn’t really enough to help you diagnose the issue - it’s all rather vague. You’d need to confirm at a minimum the hardware set-up that’s relevant to the GPU, that you’ve got Cuda installed properly, the python environments details etc and the command and model details for what you’re running.

You may be able to get some of the Python details with a tool I put together to help with this sort of thing, https://github.com/nmstoker/gatherup

Also I should mention that I’m not using TTS directly on Windows myself so we may run into limits with my experience but others may have insight or pointers (provided you share sufficient detail to be useful).

Often doing this kind of task on Windows is more of a challenge, although I don’t want to deter you if you’re aware of that and still keen to try.

OK. In the first PC I have:

Windows 10, RAM 8 Gb (only CPU)
HP Pavilion Gaming Core i5
NVIDIA Geforce GTX 1650
1 Gb RAM Adaptor

The second PC:
Windows 10, RAM 64 Gb
AMD Ryzen Threadripper 3990X 64 Core Processor
ROG STRIX TRX40-E GAMING
GPU: NVIDIA GeForce RTX 2080 Ti vA
I have installed CUDA 10.1 and CUDNN-10.1

Thanks a lot.

This is the code I try to execute in second PC (in the first one I didn’t problems):

import os
import sys
import torch
import time
import IPython

for some reason TTS installation does not work on Colab

sys.path.append(‘TTS_repo’)

from TTS.tts.utils.generic_utils import setup_model
from TTS.utils.io import load_config
from TTS.tts.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.synthesis import synthesis

def interpolate_vocoder_input(scale_factor, spec):
“”“Interpolation to tolarate the sampling rate difference
btw tts model and vocoder”""
print(" > before interpolation :", spec.shape)
spec = torch.tensor(spec).unsqueeze(0).unsqueeze(0)
spec = torch.nn.functional.interpolate(spec, scale_factor=scale_factor, mode=‘bilinear’).squeeze(0)
print(" > after interpolation :", spec.shape)
return spec

def tts(model, text, CONFIG, use_cuda, ap, use_gl, figures=True):
t_1 = time.time()
waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)
print(mel_postnet_spec.shape)
mel_postnet_spec = ap.denormalize(mel_postnet_spec.T).T
if not use_gl:
target_sr = VOCODER_CONFIG.audio[‘sample_rate’]
vocoder_input = ap_vocoder.normalize(mel_postnet_spec.T)
if scale_factor[1] != 1:
vocoder_input = interpolate_vocoder_input(scale_factor, vocoder_input)
else:
vocoder_input = torch.tensor(vocoder_input).unsqueeze(0)
waveform = vocoder_model.inference(vocoder_input)
if use_cuda and not use_gl:
waveform = waveform.cpu()
if not use_gl:
waveform = waveform.numpy()
waveform = waveform.squeeze()
rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
tps = (time.time() - t_1) / len(waveform)
print(waveform.shape)
print(" > Run-time: {}".format(time.time() - t_1))
print(" > Real-time factor: {}".format(rtf))
print(" > Time per step: {}".format(tps))
IPython.display.display(IPython.display.Audio(waveform, rate=VOCODER_CONFIG.audio[‘sample_rate’]))
return alignment, mel_postnet_spec, stop_tokens, waveform

runtime settings

use_cuda = False

model paths

TTS_MODEL = “tts_model.pth.tar”
TTS_CONFIG = “config.json”
VOCODER_MODEL = “vocoder_model.pth.tar”
VOCODER_CONFIG = “config_vocoder.json”

TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)
VOCODER_CONFIG.audio[‘stats_path’] = ‘scale_stats_vocoder.npy’

load the audio processor

ap = AudioProcessor(**TTS_CONFIG.audio)
ap_vocoder = AudioProcessor(**VOCODER_CONFIG[‘audio’])

scale factor for sampling rate difference

scale_factor = [1, VOCODER_CONFIG[‘audio’][‘sample_rate’] / ap.sample_rate]
print(f"scale_factor: {scale_factor}")

LOAD TTS MODEL

multi speaker

speaker_id = None
speakers = []

load the model

num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)
model = setup_model(num_chars, len(speakers), TTS_CONFIG)

load model state

cp = torch.load(TTS_MODEL, map_location=torch.device(‘cpu’))

load the model

model.load_state_dict(cp[‘model’])
if use_cuda:
model.cuda()
model.eval()

set model stepsize

if ‘r’ in cp:
model.decoder.set_r(cp[‘r’])

from TTS.vocoder.utils.generic_utils import setup_generator

LOAD VOCODER MODEL

vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.load_state_dict(torch.load(VOCODER_MODEL, map_location=“cpu”)[“model”])
vocoder_model.remove_weight_norm()
vocoder_model.inference_padding = 0

if use_cuda:
vocoder_model.cuda()
vocoder_model.eval();

sentence = “rápido corren los carros cargados de fierro del ferrocarril”
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)
sr = 24000
from scipy.io.wavfile import write

write(“ejemplo.wav”, sr, wav)

I tried to execute TTS/bin/compute_statistics.py --config_path TTS/tts/configs/config.json --out_path LJSpeech-1.1/scale_ststs.npy

The error was the same =(

I’m at work now so I’ll need to look further this evening, but could you clarify a bit more:

  1. where did you get that from? Is it direct from the repo or some other source (I wasn’t expecting you to paste a tonne of code, I was more thinking you’d say you’d been trying one of the scripts)

  2. What about your python environment and how you set it up?

  3. What about details of the model?

Also it might be easier for you to first try the installation as explained in the repo README and then try out something like the terminal tools:

That way you can figure out if you’ve got the set up right to recreate basic speech output before going further.

It was hard but I could solve the problem with the second PC with GPU. It was a problem with pytorch. I could execute the code of above and compute_statistics.py.

I only have a bug when I try to execute train_tacotron.py because of a this message:

(TF2) C:\Users\Voice-trainner\MozillaTTS\TTS>python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json
2021-02-16 13:52:20.737504: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll

Using CUDA: False
Number of GPUs: 0
Mixed precision mode is ON
Git Hash: e9e0784
Experiment folder: Models/LJSpeech/ljspeech-ddc-February-16-2021_01+52PM-e9e0784
Setting up Audio Processor…
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:scale_stats.npy
| > hop_length:256
| > win_length:1024
| > Found 13100 files in C:\Users\Voice-trainner\MozillaTTS\TTS\LJSpeech-1.1
Using model: Tacotron2
! Run is removed from Models/LJSpeech/ljspeech-ddc-February-16-2021_01+52PM-e9e0784
Traceback (most recent call last):
File “TTS/bin/train_tacotron.py”, line 721, in
main(args)
File “TTS/bin/train_tacotron.py”, line 524, in main
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
AttributeError: module ‘torch.cuda’ has no attribute ‘amp’

I have pytorch 1.4. I am looking for in forum and I found the same error in pytorch 1.6 and 1.7.

Glad you’ve made progress :slightly_smiling_face:

It looks like you now know your error’s ultimate source: you’ve got there wrong version of torch and must’ve done something wrong with installation if you’ve ended up with 1.4 (as you’ll see in requirements.txt you’ll see it’s > 1.5

Oh and the other thing is that the code isn’t using your GPU (see the output, it says using CUDA = False, number of GPUs 0)

So you need to sort that out too

Hi again. I don’t understand why TTS doesn’t recognize GPU but… it’s training =)

I would like to know how can I get the following files when training is finished:

vocoder_model.pth.tar
scale_stats_vocoder-npy

Thanks again

In related to GPU, when I tried torch.device and map_location with “cuda:0” instead “cpu” the message was: RunTImeError:Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map.location-torch.device(‘cpu’) to map your storages to the CPU It’s weird because TTS recognize cuda DDL.

I would like your help because training will last too much.

What is your platform/os and how did you install Pytorch?

Hi. I am working in Windows 10. I have 2 GPUs GeForce NVIDIA RTX 2080 Ti rev A. I have installed CUDA 10.1 and CUDNN 10.1.

I installed pytorch of this way:

conda install pytorch=01.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch

When I tried with this, I couldn’t train the model because of this message:

Traceback (most recent call last):
File “TTS/bin/train_tacotron.py”, line 721, in
main(args)
File “TTS/bin/train_tacotron.py”, line 505, in main
init_distributed(args.rank, num_gpus, args.group_id,
File “c:\users\voice-trainner\proyecto\tts\TTS\utils\distribute.py”, line 69, in init_distributed
dist.init_process_group(
AttributeError: module ‘torch.distributed’ has no attribute ‘init_process_group’

As far I found this error is because of PyTorch doesn’t support multi GPUs in Windows but using library torch.nn.parallel.DistributtedDataparallel. How I should modify scripts in MozillaTTS to use both GPUs in Windows and avoiding uninstall one of the GPUs?

Well I uninstalled one the GPU, but the message was:

python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json
2021-02-17 16:53:06.621290: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll

Using CUDA: True
Number of GPUs: 1
Mixed precision mode is ON
Git Hash: e9e0784
Experiment folder: Models/LJSpeech/ljspeech-ddc-February-17-2021_04+53PM-e9e0784
Setting up Audio Processor…
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:scale_stats.npy
| > hop_length:256
| > win_length:1024
| > Found 13100 files in C:\Users\Voice-trainner\MozillaTTS\TTS\LJSpeech-1.1
Using model: Tacotron2

Model has 47914548 parameters

DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 12969
| > Max length sequence: 187
| > Min length sequence: 5
| > Avg length sequence: 98.3403500655409
| > Num. instances discarded by max-min (max=153, min=6) seq limits: 476
| > Batch group size: 16.

EPOCH: 0/1000

Number of output frames: 7
TRAINING (2021-02-17 16:53:17)
2021-02-17 16:53:18.451386: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
Using CUDA: True
Number of GPUs: 1
2021-02-17 16:53:21.205552: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
Using CUDA: True
Number of GPUs: 1
2021-02-17 16:53:23.927322: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
Using CUDA: True
Number of GPUs: 1
2021-02-17 16:53:26.653252: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
Using CUDA: True
Number of GPUs: 1
! Run is removed from Models/LJSpeech/ljspeech-ddc-February-17-2021_04+53PM-e9e0784
Traceback (most recent call last):
File “TTS/bin/train_tacotron.py”, line 721, in
main(args)
File “TTS/bin/train_tacotron.py”, line 619, in main
train_avg_loss_dict, global_step = train(train_loader, model,
File “TTS/bin/train_tacotron.py”, line 165, in train
decoder_output, postnet_output, alignments, stop_tokens, decoder_backward_output, alignments_backward = model(
File “C:\Users\Voice-trainner\anaconda3\envs\TF2\lib\site-packages\torch\nn\modules\module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “c:\users\voice-trainner\proyecto\tts\TTS\tts\models\tacotron2.py”, line 148, in forward
encoder_outputs = self.encoder(embedded_inputs, text_lengths)
File “C:\Users\Voice-trainner\anaconda3\envs\TF2\lib\site-packages\torch\nn\modules\module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “c:\users\voice-trainner\proyecto\tts\TTS\tts\layers\tacotron2.py”, line 109, in forward
o, _ = self.lstm(o)
File “C:\Users\Voice-trainner\anaconda3\envs\TF2\lib\site-packages\torch\nn\modules\module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\Voice-trainner\anaconda3\envs\TF2\lib\site-packages\torch\nn\modules\rnn.py”, line 579, in forward
result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

I think it would help if you confirmed what you’ve got installed within your environment in case you’ve accidentally messed something up.

I notice you’ve got what appears to be a typo in the text you said you used to install Pytorch. Could you copy paste that again so we can be sure you’ve used 1.6 rather than 0.1.6 (which is also a version but it’s way too old). Always good to copy paste what you can as it cuts down on transcription errors (which will just confuse everyone further!)

Thanks!

Sorry. It was an typing error. The command was:

conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch