Final results LPCNet + Tacotron2 (Spanish)

Hello, just to share my results.I’m stopping at 47 k steps for tacotron 2:

The gaps seems normal for my data and not affecting the performance.

As reference for others:

Final audios:
(feature-23 is a mouth twister)
47k.zip (1,0 MB)

Experiment with new LPCNet model:

  • real speech.wav = audio from the training set
  • old lpcnet model.wav = generated using the real features of real speech.wav with the old model (male)
  • from features.wav = the fine tuned old LPCNet model with the new female voice, audio generated with real speech features. 600k steps with 14h of voice.
    test.zip (1,1 MB)

It was a surprise for me to see the male voice model generates female voice.

Now about training speed:
My first model took 3h/epoch with 50h of data using a V100. (Trained for 10 epoch)

Now the new female model with 14h of speech took 30min epoch.

Epoch 1 333333/333333 [==============================] - 1858s 6ms/step - loss: 3.2461
Epoch 2  9536/333333 [..............................] - ETA: 29:54 - loss: 3.2475

It uses CuDNNGRU so is really fast to train, yes the V100 is pretty fast but most of the speed comes from the optimized CuDNN.

Of couse I’ll share the models, as always.

1 Like

Results sound quite good. If you like we can link your model on our model page.

BTW, how do you use LPCnet? Is it LPCNet features you feed to network or mel_spectrograms? Do you have a branch to check out?

With a good explanation yes, otherwise I think it may confuse people?

No, instead of 80d mel we just need 20d which in this case will be the “cepstral features” the LPC features are computed out of the cepstral features we feed.

Quick summary:

  1. We need to train LPCNet using raw pcm 16KHz 16bit mono without header, all the audios concatenated in a single file, there’s scripts to do so in the fork.
  2. We compile dump_data to extract the features out of the audio.
  3. Now we train LPCNet with the extracted features.
  4. Now that the trained is complete, we use taco=1 to compile dump_data.
  5. Extract the features we need for taco using the compiled dumb_data, just like step 2.
  6. Now with the taco fork we preprocess the dataset.
  7. Now we replace the audio folder generated by the preprocess step with the extracted features.
  8. Adapt your feeder if it is broken to match the names of the features.
  9. Train tacotron as usual.
  10. To predict we reshape the features from taco save them to a file.
  11. With the tool dump_lpcnet and the name of the trained model we extract the network weight into 2 files ‘nnet_data.c and .h’.
  12. Move them into src of LPCNet and with do make test_lpcnet taco=1
  13. With the compiled test_lpcnet we feed the name of the file predicted using tacotron and the output name to save the raw pcm.

It looks hard but once you puts your hands on, you will understand.

Forks used for my trainings :

The readme is easy to follow.

And for taco2 : (spanis branch)

It needs cleanup.

Most important change for taco the hparams and :

This script will help you to understand:

I would like to mention that we dont need text aligned, I could feed the 100h to train LPCNet, at the time of my training I did not noticed or think about.

Ohh I thought you used TTS. I was about the ask for a PR with LPCNet but I guess, it is harder than expected. But great you shared your progress. At least it shows that it is possible to use LPCNet. I guess I can try it after I waste my TODO queue with TTS.

That’s why I ended up using the Tacotron2 repo, honestly reading others issues I thought I was just wasting my time, now that I know how it works totally worth the effort to try different experiments including trying to adapt for TTS.

For my use case I need to deploy on low compute devices, so maybe TTS+LPCNet is a good combination. The roadblock is that it is Pytorch and I need to compile with Tensorflow to reuse it from DS and a TTS model.
Right now, I’m able to compile for Windows and use it from C# to compute de features, still not from text due to a missing kernel while running tacotron model that I’m still searching.

I’m sharing also expecting to see others experiments, hopefully with TTS.

@carlfm01. Me podriras hacer el favor de explicar un poco mas del proceso de utilizar LPCNet + Tacotron 2, te lo agradeceria mucho. Tambien quisiera saber si este repositorio es la implementacion de estos dos repositorios y si puedo tener los resultados en español que tu obtuviste con este repositorio https://github.com/carlfm01/LPCTron

Hello @manuel3265 Please try to keep the thread in English for others can understand, if you want to continue in Spanish I think @nukeador can helps us moving to common voice Spanish or suggest any other alternative.

Please read comment:

No, the correct forks both pinned at the comment.

Hard to guess what more can I say, or where are you stuck. In order to help you properly please share the thing that you have tried.

btw, sorry for the delay, here are the trained models :

Thanks @carlfm01 for your quick response. I’ll keep the thread in English. Actually, I have data in Spanish (Latin America), around 100 h (I can’t share them, they are private data. Sorry). But, i want to train a model from zero. At this moment I started with the first pass of your Quick summary

I hope to continue with the next steps without any problem. Can I bother you later if I have a problem? please.

Thanks.

No problem, good luck!

Hello @carlfm01, I hope you are well, I want to ask you the following questions, I hope you can answer them. How long to train LPCNet? how many epochs? What was the loss at the end of the training?

How do I carry out these steps in tacotron?

I would really appreciate if you can answer them. Thank you

About 15 epochs, but usually I test with the predicted features and if it is not the desired result I keep training. (or with real features at early training)

This depends on your dataset, for my 50h about 2.4, and for 3h like 3.2

Just like LJSpeech dataset, no changes from the usual extraction.

We need the 20d features to train tacotron, please make sure you always do “make clean”

The script for this step looks like:

mkdir -p /mnt/training_data/audio/
for i in /mnt/LJSpeech-1.1/pcms/*.s16
do
/data/home/neoxz/LPCNet/dump_data -test $i /mnt/training_data/audio/$(basename “$i” | cut -d. -f1).npy
done

the preprocess step creates /audios, /mel, and other directory that I don’t remember under /training_data

Noticed how I replace the /audio created by preprocess with /audio created by the dump_data script.

I was reading the quick summary and noticed that the link to the fork is broken, is not linking to the correct branch, for the taco2 fork you need to use spanish branch(changing to your alphabet).

Sorry about that :confused:

@carlfm01 DO NOT worry, thank you very much for the information. I wanted to ask you a question, how do I do this part on LPCNet ?.

I’m sorry but I’m new to this topic.

I executed the following command ./src/test_lpcnet.py spanish / pcms / test_features.f32 test.s16

Now what should I do with test.s16. To do the test, should I modify something other than the .h5 file?

after, I transform the test.s16 file with sox test.s16 -r 16000 -c 1 -b 16 test.wav
it is rigth?

Thanks.

  • extract the .c ang .h files with the weights with ./dump_lpcnet.py lpcnet15_384_10_G16_64.h5 (replace with your .h5 file)
  • do a make clean
  • do a make dump_data (without the taco=1)
  • Extract the real features out of the s16(the no header pcm) with ./dump_data -test test_input.s16 test_features.f32
  • do a make clean again
  • do a make test_lpcnet (this is used to generate the s16 with the previous f32 file)
  • with the generated test lpnet can run ./test_lpcnet test_features.f32 test.s16
    finally convert with sox or ffmpeg the .s16 file to header file like:
ffmpeg -f s16le -ar 16k -ac 1 -i test.s16 test-out.wav

@carlfm01 Is it normal to have this result, considering that train have a loss of 2.5 and going in the second training epoch?

http://www.mediafire.com/file/eevt7vxat3ni5vk/Spanish_train.zip/file

I attach the files, the original and the output of LPCNet.

Original: audios_archivo-1567550220302976.wav

Please upload the zip to the forums, mediafire pops a lot of spam.

@carlfm01 I am a new user and I cannot attach the zip file.
I would like to know what you think about the results.

yes, the training goes in Epoch 2.

@carlfm01 As I am a new user, I can only reply to the topic 3 times.

Any answer I will make in this answer.

@carlfm01
audio hours? or training?

of audio, it’s about 80 hours. From training, about 7 hours.

Thanks, I will do it