I have a dataset that has around 1600 samples, and I notice in order to finetune on tts I need a preprocessor. The information for this is pretty scarce and was wondering if anyone could give me pointers
Also, when do I run it? Thanks!
Thanks!
I have a dataset that has around 1600 samples, and I notice in order to finetune on tts I need a preprocessor. The information for this is pretty scarce and was wondering if anyone could give me pointers
Also, when do I run it? Thanks!
Thanks!
Hi @Rio
Have you looked in the wiki?
These preprocessors are simply a bit of code to get your data (in whatever directory and file format you have) loaded into the training programme.
You donât mention anything about how your data is currently set up but if you can, you might find it easier to organise your data into the format of an existing preprocessor - the one I use is LJSpeech.
Itâs worth looking over the code in dataset/preprocess.py itâs nothing too complicated and should be easy to see what itâs doing. You donât run it separately, itâs called when your data it loaded and the preprocessor you pick in your config is the one that gets used.
A quick overview of the LJSpeech layout: thereâs a folder /wavs for all your wav files (as youâd probably guessed!) And the two CSV files (training and validation) with rows made up of the corresponding filename stem (without the .wav extension) and the text corresponding to the audio. Itâs actually a pipe separated file (not CSV really). LJSpeech has the normalised text and the raw text - if youâve got it already normalised then these can be the same.
If you get stuck or canât visualise it from this description then perhaps itâs worth downloading LJSpeech so you can see the way itâs organised. As an added bonus, then youâll be able to do some initial training with LJSpeech data - Iâd strongly recommend that over jumping into something youâve never done before with brand new data, which is a recipe for confusion. Using LJSpeech for an initial run means youâll get the hang of the basics, flush out human error on your part, and be reasonably sure that youâre not running into issues due to your data, because youâre working with a known good dataset.
Iâd suggest having a decent look over the repo too. Generally proof of effort improves the chances of others helping you.
Maybe take a peak in the dev branch too because the README there has been smartened up so you should find links to the info you need quite easily.
Hope thatâs a start at least
Thank you for the help!
I was doing something similar with the LJSpeech layout, although you have def sped up my understanding of the layout
I didnt notice the dev branch for whatever reason during my scouring of the repo, although rest assured I looked around. Thanks for the help and stay safe!
Glad to help!
BTW, since I wrote the answer above the README in master has been updated from dev so you should be good looking in master now.
Hi. I am new in this area so I was searching in wiki FAQ but I didnât find a solution to preprocess a custom dataset. I want to use MozillaTTS with a own spanish dataset. Could you please help me? Thanks a lot.
Am confused. Did you read my answer above too?
And you looked at the rest of the wiki right? Itâs not that much to skim.
Anyway, to save you time, hereâs the page you need to understand processing https://github.com/mozilla/TTS/wiki/Dataset
Preprocessing thatâs being discussed here is loading the data, so itâs not generally going to depend the language, itâll depend on the format of your data.
One possiblity is that youâre thinking more about the cleaning functions for your text, and if so then youâd need to look at the code here (especially cleaners.py). I havenât worked with non-English transcriptions but Iâm guessing it would be similar(ish) with Spanish but there are bound to be some language differences. If itâs this youâre after advice on, I know there are others here whoâve worked in other languages so maybe they can help. Probably still worth looking over the code I link to though, so youâll have an idea of what itâs doing with English and can then think about what would be different with Spanish.
Hi. I tried MozillaTTS in a PC (Windows 10) and I didnât have problems to synthesize some audios using some demo you published in github. BUt when I tried MozillaTTS in an other PC (WIndows 10) when I tried to synthesize It wasnât possible, but the message was:
key already registered with the same priority: GroupS partial Softmax
What can I do to solve this problem?
Thanks a lot
@luis.vera.heredia you would need to look at why the set up between the computers is different.
What youâve given above isnât really enough to help you diagnose the issue - itâs all rather vague. Youâd need to confirm at a minimum the hardware set-up thatâs relevant to the GPU, that youâve got Cuda installed properly, the python environments details etc and the command and model details for what youâre running.
You may be able to get some of the Python details with a tool I put together to help with this sort of thing, https://github.com/nmstoker/gatherup
Also I should mention that Iâm not using TTS directly on Windows myself so we may run into limits with my experience but others may have insight or pointers (provided you share sufficient detail to be useful).
Often doing this kind of task on Windows is more of a challenge, although I donât want to deter you if youâre aware of that and still keen to try.
OK. In the first PC I have:
Windows 10, RAM 8 Gb (only CPU)
HP Pavilion Gaming Core i5
NVIDIA Geforce GTX 1650
1 Gb RAM Adaptor
The second PC:
Windows 10, RAM 64 Gb
AMD Ryzen Threadripper 3990X 64 Core Processor
ROG STRIX TRX40-E GAMING
GPU: NVIDIA GeForce RTX 2080 Ti vA
I have installed CUDA 10.1 and CUDNN-10.1
Thanks a lot.
This is the code I try to execute in second PC (in the first one I didnât problems):
import os
import sys
import torch
import time
import IPython
sys.path.append(âTTS_repoâ)
from TTS.tts.utils.generic_utils import setup_model
from TTS.utils.io import load_config
from TTS.tts.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.synthesis import synthesis
def interpolate_vocoder_input(scale_factor, spec):
âââInterpolation to tolarate the sampling rate difference
btw tts model and vocoderâ""
print(" > before interpolation :", spec.shape)
spec = torch.tensor(spec).unsqueeze(0).unsqueeze(0)
spec = torch.nn.functional.interpolate(spec, scale_factor=scale_factor, mode=âbilinearâ).squeeze(0)
print(" > after interpolation :", spec.shape)
return spec
def tts(model, text, CONFIG, use_cuda, ap, use_gl, figures=True):
t_1 = time.time()
waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)
print(mel_postnet_spec.shape)
mel_postnet_spec = ap.denormalize(mel_postnet_spec.T).T
if not use_gl:
target_sr = VOCODER_CONFIG.audio[âsample_rateâ]
vocoder_input = ap_vocoder.normalize(mel_postnet_spec.T)
if scale_factor[1] != 1:
vocoder_input = interpolate_vocoder_input(scale_factor, vocoder_input)
else:
vocoder_input = torch.tensor(vocoder_input).unsqueeze(0)
waveform = vocoder_model.inference(vocoder_input)
if use_cuda and not use_gl:
waveform = waveform.cpu()
if not use_gl:
waveform = waveform.numpy()
waveform = waveform.squeeze()
rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
tps = (time.time() - t_1) / len(waveform)
print(waveform.shape)
print(" > Run-time: {}".format(time.time() - t_1))
print(" > Real-time factor: {}".format(rtf))
print(" > Time per step: {}".format(tps))
IPython.display.display(IPython.display.Audio(waveform, rate=VOCODER_CONFIG.audio[âsample_rateâ]))
return alignment, mel_postnet_spec, stop_tokens, waveform
use_cuda = False
TTS_MODEL = âtts_model.pth.tarâ
TTS_CONFIG = âconfig.jsonâ
VOCODER_MODEL = âvocoder_model.pth.tarâ
VOCODER_CONFIG = âconfig_vocoder.jsonâ
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)
VOCODER_CONFIG.audio[âstats_pathâ] = âscale_stats_vocoder.npyâ
ap = AudioProcessor(**TTS_CONFIG.audio)
ap_vocoder = AudioProcessor(**VOCODER_CONFIG[âaudioâ])
scale_factor = [1, VOCODER_CONFIG[âaudioâ][âsample_rateâ] / ap.sample_rate]
print(f"scale_factor: {scale_factor}")
speaker_id = None
speakers = []
num_chars = len(phonemes) if TTS_CONFIG.use_phonemes else len(symbols)
model = setup_model(num_chars, len(speakers), TTS_CONFIG)
cp = torch.load(TTS_MODEL, map_location=torch.device(âcpuâ))
model.load_state_dict(cp[âmodelâ])
if use_cuda:
model.cuda()
model.eval()
if ârâ in cp:
model.decoder.set_r(cp[ârâ])
from TTS.vocoder.utils.generic_utils import setup_generator
vocoder_model = setup_generator(VOCODER_CONFIG)
vocoder_model.load_state_dict(torch.load(VOCODER_MODEL, map_location=âcpuâ)[âmodelâ])
vocoder_model.remove_weight_norm()
vocoder_model.inference_padding = 0
if use_cuda:
vocoder_model.cuda()
vocoder_model.eval();
sentence = ârĂĄpido corren los carros cargados de fierro del ferrocarrilâ
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)
sr = 24000
from scipy.io.wavfile import write
write(âejemplo.wavâ, sr, wav)
I tried to execute TTS/bin/compute_statistics.py --config_path TTS/tts/configs/config.json --out_path LJSpeech-1.1/scale_ststs.npy
The error was the same =(
Iâm at work now so Iâll need to look further this evening, but could you clarify a bit more:
where did you get that from? Is it direct from the repo or some other source (I wasnât expecting you to paste a tonne of code, I was more thinking youâd say youâd been trying one of the scripts)
What about your python environment and how you set it up?
What about details of the model?
Also it might be easier for you to first try the installation as explained in the repo README and then try out something like the terminal tools:
That way you can figure out if youâve got the set up right to recreate basic speech output before going further.
It was hard but I could solve the problem with the second PC with GPU. It was a problem with pytorch. I could execute the code of above and compute_statistics.py.
I only have a bug when I try to execute train_tacotron.py because of a this message:
(TF2) C:\Users\Voice-trainner\MozillaTTS\TTS>python TTS/bin/train_tacotron.py --config_path TTS/tts/configs/config.json
2021-02-16 13:52:20.737504: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
Using CUDA: False
Number of GPUs: 0
Mixed precision mode is ON
Git Hash: e9e0784
Experiment folder: Models/LJSpeech/ljspeech-ddc-February-16-2021_01+52PM-e9e0784
Setting up Audio ProcessorâŚ
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:scale_stats.npy
| > hop_length:256
| > win_length:1024
| > Found 13100 files in C:\Users\Voice-trainner\MozillaTTS\TTS\LJSpeech-1.1
Using model: Tacotron2
! Run is removed from Models/LJSpeech/ljspeech-ddc-February-16-2021_01+52PM-e9e0784
Traceback (most recent call last):
File âTTS/bin/train_tacotron.pyâ, line 721, in
main(args)
File âTTS/bin/train_tacotron.pyâ, line 524, in main
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
AttributeError: module âtorch.cudaâ has no attribute âampâ
I have pytorch 1.4. I am looking for in forum and I found the same error in pytorch 1.6 and 1.7.
Glad youâve made progress
It looks like you now know your errorâs ultimate source: youâve got there wrong version of torch and mustâve done something wrong with installation if youâve ended up with 1.4 (as youâll see in requirements.txt youâll see itâs > 1.5
Oh and the other thing is that the code isnât using your GPU (see the output, it says using CUDA = False, number of GPUs 0)
So you need to sort that out too
Hi again. I donât understand why TTS doesnât recognize GPU but⌠itâs training =)
I would like to know how can I get the following files when training is finished:
vocoder_model.pth.tar
scale_stats_vocoder-npy
Thanks again
In related to GPU, when I tried torch.device and map_location with âcuda:0â instead âcpuâ the message was: RunTImeError:Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map.location-torch.device(âcpuâ) to map your storages to the CPU Itâs weird because TTS recognize cuda DDL.
I would like your help because training will last too much.
What is your platform/os and how did you install Pytorch?