I am trying to prepare cleaners.py for Turkish language. I want to use phonemes for training. But it confused me what it said in config.json and in FAQ and in function for cleaners.py’s portuguese.
Question is::
It is written that phonemizer handles expanding abbreviation and numbers in cleaners.py’s function for portuguese. If phonemizer does it, can I get rid of typing the turkish_cleaner function using “phoneme_cleaners” as “text_cleaner”? How can I check if Phonemizer does expanding abbreviation and numbers for Turkish? How is phoneme_cleaner different from other cleaners and why is it used in config.json?
- If you have a dataset with a different alphabet than English Latin, you need to add your alphabet in
utils.text.symbols
.
- If you use phonemes for training and your language is supported here, you don’t need to do that.
- Write your own text cleaner in
utils.text.cleaners
. It is not always necessary to expect you have a different alphabet or language-specific requirements.
- This step is used to expand numbers, abbreviations and normalizing the text.
In step 2, “If you use phonemes for training and your language is supported [here] (https://github.com/bootphon/phonemizer#supported-languages), you don’t need to do that.” is written. I checked and there is support for Turkish. Then I don’t need to change the ‘utils.text.symbols’ document.
Also, when I examined cleaners.py,
I came across a comment like this for portuguese cleaners.
def portuguese_cleaners(text):
> ‘’'Basic pipeline for Portuguese text. There is no need to expand abbreviation and
> numbers, phonemizer already does that’’'
text = lowercase(text)
text = replace_symbols(text, lang=‘pt’)
text = remove_aux_symbols(text)
text = collapse_whitespace(text)
return text
But there is a section in config.json like
// DATA LOADING "text_cleaner": "phoneme_cleaners", "enable_eos_bos_chars": false, "num_loader_workers": 4, "num_val_loader_workers": 4, "batch_group_size": 0, . "min_seq_len": 6, "max_seq_len": 153,
// PHONEMES
“phoneme_cache_path”: “phoneme_cache/”,
“use_phonemes”: true,
“phoneme_language”: “en-us”,
Here it says phoneme_cleaners in the text_cleaner section. I want to use phoneme, not the normal alphabet for training. For this
“text_cleaner”: “phoneme_cleaners”
should the part “text_cleaner” remain like this? Or should I change it to this?
“text_cleaner”: “turkish_cleaners”,
My question is
It is written that phonemizer handles expanding abbreviation and numbers in cleaners.py’s function for portuguese. If phonemizer does it, can I get rid of typing the turkish_cleaner function using “phoneme_cleaners” as “text_cleaner”? How can I check if Phonemizer does expanding abbreviation and numbers for Turkish? How is phoneme_cleaner different from other cleaners and why is it used in config.json?