Final audios:
(feature-23 is a mouth twister) 47k.zip (1,0 MB)
Experiment with new LPCNet model:
real speech.wav = audio from the training set
old lpcnet model.wav = generated using the real features of real speech.wav with the old model (male)
from features.wav = the fine tuned old LPCNet model with the new female voice, audio generated with real speech features. 600k steps with 14h of voice. test.zip (1,1 MB)
It was a surprise for me to see the male voice model generates female voice.
Now about training speed:
My first model took 3h/epoch with 50h of data using a V100. (Trained for 10 epoch)
Now the new female model with 14h of speech took 30min epoch.
With a good explanation yes, otherwise I think it may confuse people?
No, instead of 80d mel we just need 20d which in this case will be the “cepstral features” the LPC features are computed out of the cepstral features we feed.
Quick summary:
We need to train LPCNet using raw pcm 16KHz 16bit mono without header, all the audios concatenated in a single file, there’s scripts to do so in the fork.
We compile dump_data to extract the features out of the audio.
Now we train LPCNet with the extracted features.
Now that the trained is complete, we use taco=1 to compile dump_data.
Extract the features we need for taco using the compiled dumb_data, just like step 2.
Now with the taco fork we preprocess the dataset.
Now we replace the audio folder generated by the preprocess step with the extracted features.
Adapt your feeder if it is broken to match the names of the features.
Train tacotron as usual.
To predict we reshape the features from taco save them to a file.
With the tool dump_lpcnet and the name of the trained model we extract the network weight into 2 files ‘nnet_data.c and .h’.
Move them into src of LPCNet and with do make test_lpcnet taco=1
With the compiled test_lpcnet we feed the name of the file predicted using tacotron and the output name to save the raw pcm.
It looks hard but once you puts your hands on, you will understand.
I would like to mention that we dont need text aligned, I could feed the 100h to train LPCNet, at the time of my training I did not noticed or think about.
Ohh I thought you used TTS. I was about the ask for a PR with LPCNet but I guess, it is harder than expected. But great you shared your progress. At least it shows that it is possible to use LPCNet. I guess I can try it after I waste my TODO queue with TTS.
That’s why I ended up using the Tacotron2 repo, honestly reading others issues I thought I was just wasting my time, now that I know how it works totally worth the effort to try different experiments including trying to adapt for TTS.
For my use case I need to deploy on low compute devices, so maybe TTS+LPCNet is a good combination. The roadblock is that it is Pytorch and I need to compile with Tensorflow to reuse it from DS and a TTS model.
Right now, I’m able to compile for Windows and use it from C# to compute de features, still not from text due to a missing kernel while running tacotron model that I’m still searching.
I’m sharing also expecting to see others experiments, hopefully with TTS.
@carlfm01. Me podriras hacer el favor de explicar un poco mas del proceso de utilizar LPCNet + Tacotron 2, te lo agradeceria mucho. Tambien quisiera saber si este repositorio es la implementacion de estos dos repositorios y si puedo tener los resultados en español que tu obtuviste con este repositorio https://github.com/carlfm01/LPCTron
Hello @manuel3265 Please try to keep the thread in English for others can understand, if you want to continue in Spanish I think @nukeador can helps us moving to common voice Spanish or suggest any other alternative.
Please read comment:
No, the correct forks both pinned at the comment.
Hard to guess what more can I say, or where are you stuck. In order to help you properly please share the thing that you have tried.
Thanks @carlfm01 for your quick response. I’ll keep the thread in English. Actually, I have data in Spanish (Latin America), around 100 h (I can’t share them, they are private data. Sorry). But, i want to train a model from zero. At this moment I started with the first pass of your Quick summary
I hope to continue with the next steps without any problem. Can I bother you later if I have a problem? please.
Hello @carlfm01, I hope you are well, I want to ask you the following questions, I hope you can answer them. How long to train LPCNet? how many epochs? What was the loss at the end of the training?
How do I carry out these steps in tacotron?
I would really appreciate if you can answer them. Thank you
About 15 epochs, but usually I test with the predicted features and if it is not the desired result I keep training. (or with real features at early training)
This depends on your dataset, for my 50h about 2.4, and for 3h like 3.2
Just like LJSpeech dataset, no changes from the usual extraction.
We need the 20d features to train tacotron, please make sure you always do “make clean”
The script for this step looks like:
mkdir -p /mnt/training_data/audio/
for i in /mnt/LJSpeech-1.1/pcms/*.s16
do
/data/home/neoxz/LPCNet/dump_data -test $i /mnt/training_data/audio/$(basename “$i” | cut -d. -f1).npy
done
the preprocess step creates /audios, /mel, and other directory that I don’t remember under /training_data
Noticed how I replace the /audio created by preprocess with /audio created by the dump_data script.
I was reading the quick summary and noticed that the link to the fork is broken, is not linking to the correct branch, for the taco2 fork you need to use spanish branch(changing to your alphabet).