Before I start Tacotron2 training on the complete dataset another question to the community regarding vocoders: besides the default Griffin-Lim (from librosa?) I see there are some more options available that may yield even better quality like WaveGlow, WaveRNN, MelGAN etc.
Which of the vocoder options would you choose today (April 2020)?
Should I consider anything for this when configuring Taco2, e.g. certain parameters or checkout a certain branch/tag?
@mrthorstenm: I now checked out espeak-ng and its dictionary file for german:
Question: Do you guys plan on using this sort of file or what pronunciation patterns are you working with otherwise?
So, just to align with you guys and in anticipation you focus on espeak-ng patterns, I would suggest to proceed as follows:
1 - the file holds currently around 1000 words in total, so I would try to enlarge the word list significantly in general
2 - I would try to find some basic heuristics where automated creation of pronunciation patterns works out of the box OR where it comes close to be correct and needs only little correction by humans
3 - this, I try to do in a loop: extend the dictionary, find heuristics, test, extend again, check heuristics etc.
4 - if this does not work OR (“and-or”) I could write a simple web app where - similarly acting like Common Voice - one gets a espeak-ng sample played out loud in the browser (shouldn’t be that difficult since espeak-ng can create wav files) together with the current “educated guess” of the pronunciation pattern and then just corrects it if necessary. Such an app could be made public or simply distributed to whoever likes to help extending the dictionary.
Great - if everyone agrees…
Thanks for the pointer, I know Mycroft but did not check out their app yet. Will do asap.
Btw: do you guys use Slack or similar for development? Discourse is nice (but CUMBERSOME regarding logging in…) but not really suitable for joint work, I think.
PS: … and I am curious how to be able to reply to (and beneath) single abstracts, like you did with my previous post @dkreutz ?
Ok, I checked out Mycroft’s Mimic Pronounce, as they call it - to me it does look like this is NOT open source!? I could not find the source code, neither on their github repo nor on the website. The “Recording Studio” is something else.
Hi again - of course I haven’t followed all your discussions so far, so if you would be so kind to put me on the right spot here, thanks …
What is the point in using Mimic Pronounce if you cannot use it for your own data (or can you?) It is even not adaptable yet (because “closed-source”)
So, should I follow then my “plan” to extend the de_list from espeak-ng as described above? Regarding the web app, this wouldn’t be top priority on my list for now. But as a possible milestone if Mimic Pronounce is not available (or are there other tools someone knows of?)
As both - @mrthorstenm and me - are not on Slack and you seem to be interested in Mycroft anyway - would you mind joining us on Mycroft-chat channel ~language-de
Hello @mrthorstenm, I’ve been following your work for a while, a big thanks for your efforts! unfortunately the SoundCloud link seems to be broken though. Also, can we hope for a new release of your raw data any time soon? I did some promising experiments with your January data and would like to repeat them with the current version.
Edit: Soundcloud link works now, perhaps just a temporary outage.
We (@dkreutz, @repodiac, @baconator, gras64 (from mycroft community) and me) are currently trying different parameter settings for mozilla based tacotron 1 and 2 training. Nothing ready to show right now, but we’re on the way.
In parallel @repodiac is trying to improve german pronunciation with kindly support and knowhow from @nmstoker.
Thank’s for all your amazing support. Hopefully we can provide a free to use german voice model in an acceptable quality.
@dkreutz and @baconator are still trying out different training configurations, while @repodiac is becoming an expert for German phonemes and espeak (-ng), providing great support
Since I am dissatisfied with the “text length vs STD” graphic from the “AnalyzeDataset” notebook, I returned to the microphone and record more phrases (around 3k phrases) with char length between 100 and 180. This will take some weeks.