Contributing my german voice for tts

nmstoker · February 12, 2020, 9:06pm

And @mrthorstenm I’m hoping the above may be some help to you too

mrthorstenm · February 15, 2020, 8:32am

Thanks @nmstoker. Of course are your instructions helpful for me too.
I provided a cleaned uo dataset to @dkreutz who optimized the files on random noise and echo. So while i’m recording new sentences he is on processing/analysis the dataset.

Currently training is around step 39k and we have a few questions on interpreting the graphs (based on 20k training step).

Results from dataset analysis:

Should we remove phrases longer than 125 from dataset?

Any ideas on the graphs?

Eval and training alignment graphs

TrainingFigure graph looks “disrupted”. Is this okay?

EvalFigures graph stops before reaching right upper corner. Is this okay?

CheckDatasetSNR (signal-to-noise ration)

Value of 100 should be best. So dataset has 5.000 recordings that have a great value.

General questions:

As far as i know we have to start a new training run if we remove or add files to the dataset or can we modify the model after training is finished?

nmstoker · February 17, 2020, 5:15pm

Should we remove phrases longer than 125 from dataset?

Assuming that those sentences have nothing wrong from a quality/consistency perspective, it might be better to keep them in the dataset and simply let the training code include/remove them based on the settings you use in config.json. This would give you more flexibility and you could easily compare a run that included longer sentences with one that didn’t, to see where the models match your needs best.

You’ll see at the start of training that it outputs details about the max and min length in the config and then shows how many sentences were excluded.

I’m just on a break at work so will need to follow up on your other points later

mrthorstenm · February 18, 2020, 6:14pm

I fail on running compute_embeddings.py with default ljspeech dataset.
Since @dkreutz seems to get the identical error i opened an issue on github.

File "compute_embeddings.py", line 76, in <module>
    model = SpeakerEncoder(**c.model)
TypeError: type object argument after ** must be a mapping, not str

All tips are welcome.

github.com/mozilla/TTS

Running compute_embeddings.py fails with "TypeError: type object argument after ** must be a mapping, not str"

opened 06:11PM - 18 Feb 20 UTC

closed 07:45PM - 19 Feb 20 UTC

thorstenMueller

Hello dear community. Thanks to the great support by @nmstoker i try to run c…ompute_embeddings.py (master branch) on a ljspeech dataset in a venv environment: > python3 ./compute_embeddings.py __path__/best_model.pth.tar __path__/speaker_encoder/config.json __path__/LJSpeech-1.1 __path__/output The process fails directly with following error: ``` > Setting up Audio Processor... | > sample_rate:22050 | > num_mels:80 | > min_level_db:-100 | > frame_shift_ms:12.5 | > frame_length_ms:50 | > ref_level_db:20 | > num_freq:1025 | > power:1.5 | > preemphasis:0.98 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:0 | > mel_fmax:8000.0 | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > sound_norm:False | > n_fft:2048 | > hop_length:275 | > win_length:1100 Traceback (most recent call last): File "./compute_embeddings.py", line 76, in <module> model = SpeakerEncoder(**c.model) TypeError: type object argument after ** must be a mapping, not str ``` **dataset and config source** - LJSpeech dataset: https://keithito.com/LJ-Speech-Dataset/ - Model and config.json (from released models): https://drive.google.com/drive/folders/10ymOlWHutqTtfDYhIbHULn2IKDKP0O9m Split metadata.csv (even it shouldn't be needed for compute embeddings): ``` shuf metadata.csv > metadata_shuf.csv head -n 12000 metadata_shuf.csv > metadata_train.csv tail -n 1100 metadata_shuf.csv > metadata_val.csv ``` **config.json:** ``` { "github_branch":"* dev", "restore_path":"/home/thorsten/___dev/tts/datasets/mozilla-pretrained-ljspeech/best_model.pth.tar", "github_branch":"* dev", "model": "Tacotron2", // one of the model in models/ "run_name": "ljspeech-bn", "run_description": "tacotron2 basline finetuned with BN prenet", // AUDIO PARAMETERS "audio":{ // Audio processing parameters "num_mels": 80, // size of the mel spec frame. "num_freq": 1025, // number of stft frequency levels. Size of the linear spectogram frame. "sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled. "frame_length_ms": 50, // stft window length in ms. "frame_shift_ms": 12.5, // stft window hop-lengh in ms. "preemphasis": 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis. "min_level_db": -100, // normalization range "ref_level_db": 20, // reference level db, theoretically 20db is the sound of air. "power": 1.5, // value to sharpen wav signals after GL algorithm. "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation. // Normalization parameters "signal_norm": true, // normalize the spec values in range [0, 1] "symmetric_norm": true, // move normalization to range [-1, 1] "max_norm": 4, // scale normalization to range [-max_norm, max_norm] or [0, max_norm] "clip_norm": true, // clip normalized values into the range. "mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!! "mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!! "do_trim_silence": true // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true) }, // DISTRIBUTED TRAINING "distributed":{ "backend": "nccl", "url": "tcp:\/\/localhost:54321" }, "reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers. // TRAINING "batch_size": 32, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'. "eval_batch_size":16, "r": 7, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled. "gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 16], [290000, 1, 32]], // ONLY TACOTRON - set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. "loss_masking": true, // enable / disable loss masking against the sequence padding. // VALIDATION "run_eval": true, "test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time. "test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences. // OPTIMIZER "grad_clip": 1, // upper limit for gradients for clipping. "epochs": 1000, // total number of epochs to train. "lr": 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate. "lr_decay": false, // if true, Noam learning rate decaying is applied through training. "wd": 0.000001, // Weight decay weight. "warmup_steps": 4000, // Noam decay steps to increase the learning rate from 0 to "lr" // TACOTRON PRENET "memory_size": -1, // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame. "prenet_type": "bn", // "original" or "bn". "prenet_dropout": false, // enable/disable dropout at prenet. // ATTENTION "attention_type": "original", // 'original' or 'graves' "attention_heads": 5, // number of attention heads (only for 'graves') "attention_norm": "sigmoid", // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron. "windowing": false, // Enables attention windowing. Used only in eval mode. "use_forward_attn": false, // if it uses forward attention. In general, it aligns faster. "forward_attn_mask": false, // Additional masking forcing monotonicity only in eval mode. "transition_agent": false, // enable/disable transition agent of forward attention. "location_attn": true, // enable_disable location sensitive attention. It is enabled for TACOTRON by default. "bidirectional_decoder": false, // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset. // STOPNET "stopnet": true, // Train stopnet predicting the end of synthesis. "separate_stopnet": true, // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER. // TENSORBOARD and LOGGING "print_step": 25, // Number of steps to log traning on console. "save_step": 10000, // Number of training steps expected to save traninpg stats and checkpoints. "checkpoint": true, // If true, it saves checkpoints per "save_step" "tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging. // DATA LOADING "text_cleaner": "phoneme_cleaners", "enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars. "num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values. "num_val_loader_workers": 4, // number of evaluation data loader processes. "batch_group_size": 0, //Number of batches to shuffle after bucketing. "min_seq_len": 6, // DATASET-RELATED: minimum text length to use in training "max_seq_len": 150, // DATASET-RELATED: maximum text length // PATHS "output_path": "/home/thorsten/___dev/tts/datasets/mozilla-pretrained-ljspeech/keep/", // DATASET-RELATED: output path for all training outputs. // PHONEMES "phoneme_cache_path": "ljspeech_ph_cache", // phoneme computation is slow, therefore, it caches results in the given folder. "use_phonemes": true, // use phonemes instead of raw characters. It is suggested for better pronounciation. "phoneme_language": "en-us", // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages // MULTI-SPEAKER and GST "use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning. "style_wav_for_test": null, // path to style wav file to be used in TacotronGST inference. "use_gst": false, // TACOTRON ONLY: use global style tokens // DATASETS "datasets": // List of datasets. They all merged and they get different speaker_ids. [ { "name": "ljspeech", //"path": "/data/ro/shared/data/keithito/LJSpeech-1.1/", "path": "/home/thorsten/___dev/tts/datasets/mozilla-pretrained-ljspeech/LJSpeech-1.1/", //"path": "/home/erogol/Data/LJSpeech-1.1", "meta_file_train": "metadata_train.csv", "meta_file_val": "metadata_val.csv" } ] } ``` **General information:** - Ubuntu 18.04.4 LTS - Venv Environment - Python 3.6.9 - Pip3 version 9.0.1 **Output from pip3 list:** ``` DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning. absl-py (0.9.0) attrdict (2.0.1) attrs (19.3.0) audioread (2.1.8) bokeh (1.4.0) cachetools (4.0.0) certifi (2019.11.28) cffi (1.14.0) chardet (3.0.4) Click (7.0) clldutils (3.5.0) colorlog (4.1.0) csvw (1.7.0) cycler (0.10.0) decorator (4.4.1) Flask (1.1.1) google-auth (1.11.2) google-auth-oauthlib (0.4.1) grpcio (1.27.2) idna (2.8) isodate (0.6.0) itsdangerous (1.1.0) Jinja2 (2.11.1) joblib (0.14.1) kiwisolver (1.1.0) librosa (0.7.2) llvmlite (0.31.0) Markdown (3.2.1) MarkupSafe (1.1.1) matplotlib (3.1.3) numba (0.48.0) numpy (1.18.1) oauthlib (3.1.0) packaging (20.1) phonemizer (2.1) Pillow (7.0.0) pip (9.0.1) pkg-resources (0.0.0) protobuf (3.11.3) pyasn1 (0.4.8) pyasn1-modules (0.2.8) pycparser (2.19) pyparsing (2.4.6) python-dateutil (2.8.1) PyYAML (5.3) regex (2020.1.8) requests (2.22.0) requests-oauthlib (1.3.0) resampy (0.2.2) rfc3986 (1.3.2) rsa (4.0) scikit-learn (0.22.1) scipy (1.4.1) segments (2.1.3) setuptools (45.2.0) six (1.14.0) SoundFile (0.10.3.post1) tabulate (0.8.6) tensorboard (2.1.0) tensorboardX (2.0) torch (1.4.0) tornado (6.0.3) tqdm (4.42.1) tts (1.1) Unidecode (1.1.1) uritemplate (3.0.1) urllib3 (1.25.8) Werkzeug (1.0.0) wheel (0.34.2) ```

nmstoker · February 18, 2020, 6:35pm

Hi @mrthorstenm - from what I can see on the GitHub issue, the link to the model in Google Drive that you say you’re using is for one of the TTS models (2nd to last Tacotron2 entry in that table on Released Models page), but actually what you need to use here is the Speaker-Encoder-iter25k model. It’s the one that @sanjaesc shows in the screenshot in their reply a little further up this thread.

Then you should be able to run compute_embeddings.py (or at least we’ll be further along to getting it working for you )

mrthorstenm · February 19, 2020, 9:46pm

After chatting with @nmstoker i was able to compute embeddings in the released libri-tts dataset.
I documented my lessons-learned in the github issue and closed it.

mrthorstenm · February 19, 2020, 9:58pm

@dkreutz made a bokeh plot on my ljspeech dataset. Thanks for that .
The two clusters on the left side might result from recorded most phrases with a usb microphone in two different rooms (round about 12k phrases).
The smaller cluster on the right side was recorded with better equipment (incl. “popkiller”) and inside a smaller room (random sample). Round about 3k phrases.

Bokeh overview

bokeh_all1244×814 98.8 KB

Bokeh detail left clusters

Random sample (original voice) 1 from left top (smaller) cluster

Random sample (original voice) 2 from left top (smaller) cluster

Random sample (original voice) 3 from left top (smaller) cluster

Random sample (original voice) 1 from left bottom (bigger) cluster

Random sample (original voice) 2 from left bottom (bigger) cluster

Random sample (original voice) 3 from left bottom (bigger) cluster

bokeh_cluster_left_detail1230×824 238 KB

Bokeh detail right cluster

Random sample (original voice) 1 from right cluster

Random sample (original voice) 2 from right cluster

Random sample (original voice) 3 from right cluster

bokeh_cluster_right_detail1250×830 137 KB

Any pro tipps on the dataset before running training (again)?

erogol · February 20, 2020, 4:16pm

It is a great job! Thx for keeping up everything in this thread.

The best solution is to record them in the best format again but of course, it’s a big toil. So you might maybe train the model with the larger cluster and see how it performs. Then, you can add the other clusters and see how the model behaves. If they reduce the performance you need to record them again, unfortunately.

You can also use denoising algorithms of neural models to prettify the broken clips. That might help.

CRAZY IDEA!

You can also consider this problem as multi-speaker TTS. And you can train TTS model conditioning on these embedding vectors. Then, if the model works fine, you can regenerate poor clips with the TTS model providing the right embedding vector which matches a healthier recording.
(like the center of the larger cluster). This is something might work but, I never tried.

dkreutz · February 20, 2020, 6:30pm

Hi erogol,
Dominik here, I am the working in the background with @mrthorstenm. Thanks for looking into this.

I already thought of applying RNNoise to the audio clips, but have to figure out a good workchain yet (probably with sox and a ladspa plugin?).

And you confirmed my idea to handle this as multi-speaker “problem”. We will definitely follow this idea - and come back to you with many question how to do it

mrthorstenm · February 26, 2020, 9:57am

Just a short update.

I just made recording number 18.000 which equates to 17 hours of audio material .
@dkreutz optimized the wav files and is currently training with multi-speaker setup as suggested as “crazy idea” by @erogol .

When we are satisfied with the quality the new and optimized dataset will be published/updated on google drive for use by the community.

dkreutz · February 26, 2020, 8:15pm

Thanks to @mrthorstenm there are now some 3.000 more audio clips. I need a heads up on how to extend the dataset which is already used in training.
We are using LJSpeech data format: Do I simply copy the additional audio files to the folder “wavs” and paste the corresponding metadata at the end of metadata_(train|val).csv and then continue training with latest checkpoint?

nmstoker · February 26, 2020, 10:32pm

I had always meant to try that and I understood it to be possible but I must admit I’ve never actually tried it.

The key thing is whether the initial caching of phonemes gets done when the fine tuning option is selected. Am AFK right now but should be fairly easy to see in the code for train.py. If that did happen then the steps you mention sound like they’d work.

dkreutz · February 27, 2020, 1:44pm

Phoneme caching is a good point - haven’t thought of that!

Looking at datasets/TTSDataset.py I understand that phoneme file is automagically generated for a wav-file if it does not exist. Looking at my phoneme-cache folder confirms this as there are .npy files from different dates when I experimented with different datasets.

dkreutz · February 27, 2020, 6:45pm

So here we go: added new wav-files and appended entries to metadata-train/val.csv. Then started training again with --continue_path option.

No errors so far. Startup message “Number of instances” sums up correctly to the new total of the training set.

Phoneme cache folder has new files where names match with wav-files that were added.

dkreutz · February 28, 2020, 10:33am

Trained now approx. 7k steps/28 epochs with the extended data set. Alignment slowly improves, but loss is increasing again (no new best_model.pth.tar since the extending the data set).

Is this a reason to worry, should I stop training?

Btw: what is the difference between train.py parameters --continue_path and --restore_path? I have used “continue”, should I try “restore” instead?

erogol · February 28, 2020, 11:57am

let it train no worries. Decoder loss goes down.

–continue continues the training in the folder using the same folder as the output path

–restore restores the model but handles it as a new training run.

dkreutz · March 1, 2020, 10:18am

Training has reached 50k steps - time for another update and questions…

There was definitely an impact by adding audio files for (only) one of the speakers at 23k steps: - Loss values slowly but steadily increased - no more updates to best_model (because loss avg did not improve?) - StepTime increased - memory consumption increased (from 11GB to now 18GB) The audio example quality matches the ones from a previous run, but there are attention problems with longer sentences and german Umlaut phonemes: ä, ö, ü. I did not care too much about data-cleaner and symbols.py. Some training phrases contain “foreign words” like “olé” which don’t exist in german character set. Probably that is the reason for latter problem?

Turned out that I made a dumb error while preprocessing the additional wav-files and they were all messed up with a wrong sampling rate. Note to myself: always listen to the audios before you start training for several days…
After discussion with @mrthorstenm we will tackle the data-cleaner/symbols issue at the same time and start training from the beginning…

dkreutz · March 2, 2020, 3:29pm

Fixed the wav-files and sorted out the data-cleaner/phoneme issue with help of @erogol. Fresh training session just started - I will report back when we see first results (or problems)…

mrthorstenm · March 5, 2020, 8:31pm

Short update for statistic fans:

Phrases recorded: 18.036
Recorded audio length: 17 hours
Average sentence length: 47 chars
Chars per second (avg): 13,5
Sentences with question mark: 1.893
Sentences with exclamation mark: 1.462

The recordings are slower than my every-day speech, but therefor they are clear and without any characters swallowed (hopefully).

3.700 phrases remain for recording. After that i will finish my recordings and upload/update the complete dataset for community use.

dkreutz · March 9, 2020, 7:04pm

We reached Epoch 340 /step 112.000 - time for an update.
Diagrams:

You can clearly see where the gradual training r=3 kicked in at at 50k steps. Since that the loss values slowly increase again.

The Eval alignment was a (more or less) straight diagonal in between now again has that gap in the upper right.
The Train alignment looks better though:

Here are some audio examples:Audio-samples-step112k.zip (806,4 KB)
Overall it starts sounding good, there are some problems with the intonation of the german Umlaut vowels, probably we need to enhance the dataset with more examples for that.

Any comments, any reasons to worry?

Topic		Replies	Views
Are there any projects dealing with artificial speech DeepSpeech	1	562	June 21, 2020
TTS \| Voice Cloning \| Explaining the famous LJSpeech voice dataset and structure TTS (Text-to-Speech)	0	1317	June 23, 2024
My Success with Mozilla TTS TTS (Text-to-Speech)	7	7102	January 21, 2021
Creating a github page for hosting community trained models TTS (Text-to-Speech)	18	1427	December 17, 2020
Training 2 New Custom Datasets with TTS-recipes, need suggestions for inference/synthesis TTS (Text-to-Speech) learning	2	1668	January 28, 2022

Contributing my german voice for tts

Related topics