As both - @mrthorstenm and me - are not on Slack and you seem to be interested in Mycroft anyway - would you mind joining us on Mycroft-chat channel ~language-de
Alright, just logged in…
I followed @erogol 's appeal here.
First tacotron1 sample (training step 100k) is available here:
Soundcloud Link
I know, there’s much more training steps required, but i told you to keep this thread up to date
Hello @mrthorstenm, I’ve been following your work for a while, a big thanks for your efforts! unfortunately the SoundCloud link seems to be broken though. Also, can we hope for a new release of your raw data any time soon? I did some promising experiments with your January data and would like to repeat them with the current version.
Edit: Soundcloud link works now, perhaps just a temporary outage.
Thanks for your nice words :-).
First i want to wait for the training and release the final tacotron 1 and 2 models.
Just want to double check as I found this recommendation to enable trim_silence
- do you think that is necessary for good WaveRNN results?
if you trim silences, that might reduce the mode perf. on silences and pauses between words.
Short update.
We (@dkreutz, @repodiac, @baconator, gras64 (from mycroft community) and me) are currently trying different parameter settings for mozilla based tacotron 1 and 2 training. Nothing ready to show right now, but we’re on the way.
In parallel @repodiac is trying to improve german pronunciation with kindly support and knowhow from @nmstoker.
Thank’s for all your amazing support. Hopefully we can provide a free to use german voice model in an acceptable quality.
Could anybody help us interpreting results of dataset analysis by CheckDatasetSNR/AnalyzeDataset notebooks?
I’m unable to interpret the graph result of snr notebook.
no. of wav files: 20710
average SNR of the dataset:30.875177400417556
If it’s helpful i can provide sample wavs being marked as good and worse. Maybe anyone can explain why which file is classified this way.
Any recommendations what to optimize or which files to remove (based on audio length) before running training?
Might it be a problem having a dataset with 20k phrases but only 264 phrases longer than 100 chars?
Here’s another regular update on our progress.
@dkreutz and @baconator are still trying out different training configurations, while @repodiac is becoming an expert for German phonemes and espeak (-ng), providing great support
Since I am dissatisfied with the “text length vs STD” graphic from the “AnalyzeDataset” notebook, I returned to the microphone and record more phrases (around 3k phrases) with char length between 100 and 180. This will take some weeks.
Thanks so far, and stay tuned for futher updates.
Hi Thorsten, thanks for your updates.
I’ve been training a Belgian Dutch multi speaker model for 170k iterations until it stopped improving. The model is okay for some sentences but not consistent enough to be usable, so inspired by you I’m considering to create a single speaker database as well.
I have some questions:
- How long does it take to record one hour of audio? What is the longest recording session you’ve had so far?
- Why does the result of your model deviate so strongly from the sample provided here? The quality of your dataset is similar to that of LJSpeech, or am I mistaken?
Finally, do you have any other tips to share regarding the creation of the dataset?
Welcome @rdh.
I’m happy that I could inspire you recording your own dataset.
Recording time for one hour pure audio isn’t easy to guess since it depends on several aspects.
Average phrase length, recording method (which tool), postprocessing tasks (optimizing, silence trimming, …).
I recorded (and still record) phrases between 2 and 180 chars. Average length (currently) is 50 chars and i record 14 chars per second. Longest recording sessions might be 60 minutes without a break, mostly around 30 minutes (max).
Additional helpful informations are here:
Hey @mrthorstenm,
Thanks so much for your work and the interesting discussion here!
Please let me know if you’d need some GPU help training Tacotron-2. I would be happy to help out with sharing pre-trained models and to experiment with this repo: https://github.com/Rayhane-mamah/Tacotron-2.
New month - new update.
I’ve recorded 1,1k new phrases with a length between 100 and 180 chars. New recordings will be done based on needed phonemes, if neccesary.
After recordings i’ve written a document on the evolution of this dataset, including several graphs from dataset analysis notebooks. If you’re interested feel free to take a look
Evolution of thorsten dataset.pdf (507,0 KB) .
Currently our group (internally named on a “lord of the ring” quote: “The fellowership of the … free german tts model”) has started new training tests.
And as always - i’ll keep you updated .
I wish all of you a nice weekend.
How did you install espeak-ng? I am running Ubuntu 18.04 and have been trying for hours now to no avail – I have to install from source if I want to add custom words, right? I am trying the guide and I keep getting espeak-ng: symbol lookup error: espeak-ng: undefined symbol: espeak_ng_SetVoiceByFile
Hi George,
I’m on Arch, and initially I had installed it from the package here: https://www.archlinux.org/packages/community/x86_64/espeak-ng/ but then when I decided I wanted to try customising certain words, I went ahead with installing it from source.
It’s a while since I did it, but the instructions I followed were those written by Josh Meyer, here: http://jrmeyer.github.io/tts/2016/07/03/How-to-Add-a-Language-to-eSpeak-NG.html but with some reference to the repo itself. As I understand it, the repo has instructions based on what Josh had written up, so they’re pretty similar.
There’s someone else who posted a similar issue here: https://groups.io/g/espeak-ng/message/2637 and they also posted it in the discussion at the bottom of Josh’s instructions page mentioned above. Unfortunately they don’t list any resolution.
I see you also posted an issue in the repo. I doubt this will solve it but there are some useful pointers in this issue: https://github.com/espeak-ng/espeak-ng/issues/662 which may help rule things out.
Ultimately the problem looks like it’s connected to a change made to this file: https://github.com/espeak-ng/espeak-ng/blob/master/src/include/espeak-ng/espeak_ng.h
back in early 2019, whereby they added the symbol that’s complained about in the error message (espeak_ng_SetVoiceByFile):
As a short-term check I suppose you could try installing from the version of the source just prior to the commit where they added that symbol. That would give reassurance that it was that change at fault (or it would establish that it’s just a coincidence, but I doubt it, given the direct mention of the symbol name).
Kind regards,
Neil
Cheers Neil, thank you so much You are always so kind and helpful. I spent the entire evening trying to get it to work but it did not work, but I just got it to start on my Mac and I am actually very happy, because it works; my problem is that Swedish has a lot of compound words and I guess espeak hasn’t had a Swede work on the rules, so compounds are always mispronounced. Then I saw you guys trying about dictionaries (which was an idea I initially was going to do away with, because I worked on them so much when I did concat TTS and I am sick of them ) but now it looks like it is the only solution, because I really do not want to train a char-based TTS - I suspect it will only perform poorly and I really want the phonemes for out-of-domain words. I tried adding one now and it worked. I am also lucky because a few years ago the National Library of Oslo got a pronunciation dictionary and it is open source, so I get to use it Now I will get my hands dirty and try to fix espeak-ng on Ubuntu too, to train using it.
Hello again @mrthorstenm! Is there any way I could email you about some questions? I’m working on my master thesis about applying prosody controls to Tacotron-2 and I’d love to get in touch and discuss potentially using your dataset for experimentation. Thank you and all the best!
Lately I have been working on improving the TTS performance in compound and unseen words, since it is a hit or miss, especially since you cannot dictate the stress and it is entirely up to Tacotron what it will learn as a linguistic feature. One of the problems I had was that, in short sentences with unseen words (1 or 2 words), the stopnet sometimes tripped. That was also the case with compound words. I found that incorporating a pronunciation lexicon improves pronunciation massively and helps with the stopnet. My guess is that a large pronunciation lexicon that covers a big portion of the words is consistent in the phonemic transcriptions, so when the TTS is trained on the phoneme sequences, it may be much easier for it, when in guessing it might guess different phonemes for compound words (because it has not seen them) and trip.
Do you have some example of a tts snippet from your voice? Would be nice to know how Mozilla TTS works for german language. - I am new to Mozilla TTS and currently exploring the status quo.
Best Regards from Frankfurt