I think we can create a GitHub page to host community-driven models (including models I trained) to enable a better distribution.
I wonder to see who would like to share models and help on that?
I think we can create a GitHub page to host community-driven models (including models I trained) to enable a better distribution.
I wonder to see who would like to share models and help on that?
Count me in @erogol .
I’ve already put some german models i know (based on my dataset) on my github page here:
Thank you @erogol for all of the excellent work on MozillaTTS
Sure, I’d be happy to contribute! …the only problem is that the models I’ve trained are not compatible with upstream MozillaTTS. The vocoders should be fine, however.
I made a few tweaks to have more control over the phonemes. Specifically:
phoneme_backend
option that lets me use gruut instead of phonemizercharacters.sort_phonemes
boolean that disables phoneme sorting in the text utilscharacters.eos_bos_phonemes
boolean that disables the addition of EOS/BOS symbolsMostly, these changes ensure that the characters.phonemes
list is preserved in order, and that nothing (besides the pad symbol) is automatically added.
But the use of gruut
over phonemizer
is probably going to be a show stopper for most people. Phonemes in gruut
come from pre-built dictionaries or pre-trained grapheme-to-phoneme models, which lets me do some neat things like apply accents to voices. It also does tokenization, text cleaning, and number/currency expansion with the help of Babel and num2words.
Let me know how I can help
Hi @synesthesiam - you suggested gruut might be a “show stopper for most people”. How does it compare with using phonemizer with espeak-ng as a backend?
Is one of the concerns that it doesn’t have such broad language coverage?
If the phoneme symbols are consistent then presumably people can switch back and forth between it and phonemizer to see how it compares - I would be interested to give that a go, is there anything I should bear in mind when trying it?
Happy to move this discussion into a separate thread if that’s better.
Both gruut and phonemizer produce IPA, but gruut uses pre-built lexicons and g2p models. I haven’t tested how consistent the IPA is between the two, but I’d expect it to be pretty good for U.S. English (gruut’s U.S. English phoneme inventory is here).
For me, an important feature of gruut is that a word can have multiple pronunciations. “read”, for example, has both /ɹɛd/
(like “red”) and /ɹiːd/
(like “reed”). You can get the second pronunciation in gruut with “read_2” in your text (with word indexes enabled).
Thanks! Gruut has two stages: (1) tokenization and (2) phonemization. The command-line tool takes text in the first stage and produces JSON for the second. You can skip the first stage if you know exactly what you want:
$ echo '{ "clean_words": ["this", "is", "a", "test"] }' | bin/gruut en-us phonemize | jq .pronunciation
[
[
"ð",
"ɪ",
"s"
],
[
"ɪ",
"z"
],
[
"ə"
],
[
"t",
"ɛ",
"s",
"t"
]
]
Might be a good idea I’d be interested in feedback for non-English languages especially.
I guess to accommodate your models first we need to enable Gruut in TTS.
But maybe gruut and phonemizer generated the same outputs or at least they use the same IPA characters. In that case, we can replace Gruut with phonemizer to kick start your models in TTS.
Thanks, @erogol
As I get better at training models too, I’d also be happy to some phonemizer-based models for the community.
I have a 2080 Ti and 3x1060’s (6GB). Any tips on how I might get models trained as fast as possible?
two tricks for training faster
Two quick questions:
What is a reasonable way to host the model itself? Is something like a link to a shared Google Drive folder alright?
What are your thoughts on updating a model? I have a Tacotron2 model with an accompanying wavegrad vocoder - it’s able to produce pretty good quality output (like this) but the current model has a few issues with the stopnet and sometimes with alignment, which I think I could improve upon. I’m wondering whether it’s better to a) post sooner with the current one and then update it, or b) wait until I’ve got an improved version.
I’m leaning towards a) but happy to go with b) if that’s easier and it may avoid raising people’s hopes only for them to run into cases where the model does poorly (which are reasonably common at the moment; examples of a few issues can be heard at times in this sample; often small tweaks to the sentence can help but I’m hoping to get it beyond the need for that)
I’ve got a TTS model and vocoder I’m happy to share here:
EK1: https://drive.google.com/drive/folders/1K8g9Lh23MBWSU2IqinrMEnWCT3ZlGV2W?usp=sharing
Produced / usable with this commit c802255
The dataset is from M-AILABS / LibriVox - the Queen’s English set available here
The earlier issues I ran into seem largely absent in the latest r=2 checkpoint and the overall quality is good at that stage when used with the vocoder. They’d crept in when transitioning to r=1.
Here are two samples (one of which is the same text it was slightly struggling with in the earlier link I provided in my other comment)
Q-Learning example (260k / 200 iterations in vocoder)
Sherlock sample (260k / 200 iterations in vocoder)
As usual you can speed it up (at the cost of quality) by bring the vocoder iterations down
NB: To use it for inference with the server a small change is need - see here (thanks to @sanjaesc)
Except for the background noise which turns on and off when speaking the examples sound great and especially natural, I like it. M-AILABS / LibriVox seems to contain some data jewels.
Did you upsample the WAV files from M-AI Labs? I noticed your model is 22050Hz, but the files (at least mine) are 16Khz.
Yes, sorry forgot to mention that!
Hm, facing the same challenge to upsample. @nmstoker what tool did you use for that? ffmpeg, sox, …?
To convert from the 16kHz supplied to the 22.050kHz desired (to match other models/vocoders I had) I used sox, with this line:
for f in wavs/*.wav; do sox “$f” -r 22050 “wavs_22050/${f%%}”; done
I think it puts the output in a wavs folder inside the wavs_22050 folder, so you would need to create both before running it; if I’d had time I’d have looked at a way to avoid that (it’s probably easy for any bash masters! )
The step before was simply combining all the speaker’s wav files into a single folder from the two books (the naming convention means they don’t collide). Then I merged the metadata CSV files and did the usual shuffle and split into training and test. In case anyone hasn’t seen I included the training and test metadata files I used in with model in the Google Drive folder, which might save a little time.
Lastly one basic change I did apply to the metadata files was to remove the chapter introduction lines (“Chapter One…”), because the narrator reads them more in her native accent than the main narration. It might be a bit too manual but I suspect a little more could done to improve things by removing samples where a clearly distinct accent is used (I have a feeling some characters are Irish or Scottish sounding) - however with something near to the 40hrs+ of audio it didn’t seem to be too bad, although there are some alignment issues which I suspect are from pauses and may be contributed too by other factors like the accent issues (but that’s just a guess!)
Hope that helps!
Thanks a lot, sox seems to be faster, but I thought I have read somewhere that it sometimes is not as “good” as ffmpeg. Buth that could have been for videos. Thanks again, will try that for some of my material.