I think we can create a GitHub page to host community-driven models (including models I trained) to enable a better distribution.
I wonder to see who would like to share models and help on that?
I think we can create a GitHub page to host community-driven models (including models I trained) to enable a better distribution.
I wonder to see who would like to share models and help on that?
Count me in @erogol .
Iāve already put some german models i know (based on my dataset) on my github page here:
Thank you @erogol for all of the excellent work on MozillaTTS
Sure, Iād be happy to contribute! ā¦the only problem is that the models Iāve trained are not compatible with upstream MozillaTTS. The vocoders should be fine, however.
I made a few tweaks to have more control over the phonemes. Specifically:
phoneme_backend
option that lets me use gruut instead of phonemizercharacters.sort_phonemes
boolean that disables phoneme sorting in the text utilscharacters.eos_bos_phonemes
boolean that disables the addition of EOS/BOS symbolsMostly, these changes ensure that the characters.phonemes
list is preserved in order, and that nothing (besides the pad symbol) is automatically added.
But the use of gruut
over phonemizer
is probably going to be a show stopper for most people. Phonemes in gruut
come from pre-built dictionaries or pre-trained grapheme-to-phoneme models, which lets me do some neat things like apply accents to voices. It also does tokenization, text cleaning, and number/currency expansion with the help of Babel and num2words.
Let me know how I can help
Hi @synesthesiam - you suggested gruut might be a āshow stopper for most peopleā. How does it compare with using phonemizer with espeak-ng as a backend?
Is one of the concerns that it doesnāt have such broad language coverage?
If the phoneme symbols are consistent then presumably people can switch back and forth between it and phonemizer to see how it compares - I would be interested to give that a go, is there anything I should bear in mind when trying it?
Happy to move this discussion into a separate thread if thatās better.
Both gruut and phonemizer produce IPA, but gruut uses pre-built lexicons and g2p models. I havenāt tested how consistent the IPA is between the two, but Iād expect it to be pretty good for U.S. English (gruutās U.S. English phoneme inventory is here).
For me, an important feature of gruut is that a word can have multiple pronunciations. āreadā, for example, has both /ɹÉd/
(like āredā) and /ɹiĖd/
(like āreedā). You can get the second pronunciation in gruut with āread_2ā in your text (with word indexes enabled).
Thanks! Gruut has two stages: (1) tokenization and (2) phonemization. The command-line tool takes text in the first stage and produces JSON for the second. You can skip the first stage if you know exactly what you want:
$ echo '{ "clean_words": ["this", "is", "a", "test"] }' | bin/gruut en-us phonemize | jq .pronunciation
[
[
"Ć°",
"ÉŖ",
"s"
],
[
"ÉŖ",
"z"
],
[
"É"
],
[
"t",
"É",
"s",
"t"
]
]
Might be a good idea Iād be interested in feedback for non-English languages especially.
I guess to accommodate your models first we need to enable Gruut in TTS.
But maybe gruut and phonemizer generated the same outputs or at least they use the same IPA characters. In that case, we can replace Gruut with phonemizer to kick start your models in TTS.
Thanks, @erogol
As I get better at training models too, Iād also be happy to some phonemizer-based models for the community.
I have a 2080 Ti and 3x1060ās (6GB). Any tips on how I might get models trained as fast as possible?
two tricks for training faster
Two quick questions:
What is a reasonable way to host the model itself? Is something like a link to a shared Google Drive folder alright?
What are your thoughts on updating a model? I have a Tacotron2 model with an accompanying wavegrad vocoder - itās able to produce pretty good quality output (like this) but the current model has a few issues with the stopnet and sometimes with alignment, which I think I could improve upon. Iām wondering whether itās better to a) post sooner with the current one and then update it, or b) wait until Iāve got an improved version.
Iām leaning towards a) but happy to go with b) if thatās easier and it may avoid raising peopleās hopes only for them to run into cases where the model does poorly (which are reasonably common at the moment; examples of a few issues can be heard at times in this sample; often small tweaks to the sentence can help but Iām hoping to get it beyond the need for that)
Iāve got a TTS model and vocoder Iām happy to share here:
EK1: https://drive.google.com/drive/folders/1K8g9Lh23MBWSU2IqinrMEnWCT3ZlGV2W?usp=sharing
Produced / usable with this commit c802255
The dataset is from M-AILABS / LibriVox - the Queenās English set available here
The earlier issues I ran into seem largely absent in the latest r=2 checkpoint and the overall quality is good at that stage when used with the vocoder. Theyād crept in when transitioning to r=1.
Here are two samples (one of which is the same text it was slightly struggling with in the earlier link I provided in my other comment)
Q-Learning example (260k / 200 iterations in vocoder)
Sherlock sample (260k / 200 iterations in vocoder)
As usual you can speed it up (at the cost of quality) by bring the vocoder iterations down
NB: To use it for inference with the server a small change is need - see here (thanks to @sanjaesc)
Except for the background noise which turns on and off when speaking the examples sound great and especially natural, I like it. M-AILABS / LibriVox seems to contain some data jewels.
Did you upsample the WAV files from M-AI Labs? I noticed your model is 22050Hz, but the files (at least mine) are 16Khz.
Yes, sorry forgot to mention that!
Hm, facing the same challenge to upsample. @nmstoker what tool did you use for that? ffmpeg, sox, ā¦?
To convert from the 16kHz supplied to the 22.050kHz desired (to match other models/vocoders I had) I used sox, with this line:
for f in wavs/*.wav; do sox ā$fā -r 22050 āwavs_22050/${f%%}ā; done
I think it puts the output in a wavs folder inside the wavs_22050 folder, so you would need to create both before running it; if Iād had time Iād have looked at a way to avoid that (itās probably easy for any bash masters! )
The step before was simply combining all the speakerās wav files into a single folder from the two books (the naming convention means they donāt collide). Then I merged the metadata CSV files and did the usual shuffle and split into training and test. In case anyone hasnāt seen I included the training and test metadata files I used in with model in the Google Drive folder, which might save a little time.
Lastly one basic change I did apply to the metadata files was to remove the chapter introduction lines (āChapter Oneā¦ā), because the narrator reads them more in her native accent than the main narration. It might be a bit too manual but I suspect a little more could done to improve things by removing samples where a clearly distinct accent is used (I have a feeling some characters are Irish or Scottish sounding) - however with something near to the 40hrs+ of audio it didnāt seem to be too bad, although there are some alignment issues which I suspect are from pauses and may be contributed too by other factors like the accent issues (but thatās just a guess!)
Hope that helps!
Thanks a lot, sox seems to be faster, but I thought I have read somewhere that it sometimes is not as āgoodā as ffmpeg. Buth that could have been for videos. Thanks again, will try that for some of my material.