Basic Cleaners or Phoneme Cleaners

I am trying to prepare cleaners.py for Turkish language. I want to use phonemes for training. But it confused me what it said in config.json and in FAQ and in function for cleaners.py’s portuguese.

Question is::
It is written that phonemizer handles expanding abbreviation and numbers in cleaners.py’s function for portuguese. If phonemizer does it, can I get rid of typing the turkish_cleaner function using “phoneme_cleaners” as “text_cleaner”? How can I check if Phonemizer does expanding abbreviation and numbers for Turkish? How is phoneme_cleaner different from other cleaners and why is it used in config.json?

  1. If you have a dataset with a different alphabet than English Latin, you need to add your alphabet in utils.text.symbols .
  • If you use phonemes for training and your language is supported here, you don’t need to do that.
  1. Write your own text cleaner in utils.text.cleaners . It is not always necessary to expect you have a different alphabet or language-specific requirements.
  • This step is used to expand numbers, abbreviations and normalizing the text.

In step 2, “If you use phonemes for training and your language is supported [here] (https://github.com/bootphon/phonemizer#supported-languages), you don’t need to do that.” is written. I checked and there is support for Turkish. Then I don’t need to change the ‘utils.text.symbols’ document.
Also, when I examined cleaners.py,

I came across a comment like this for portuguese cleaners.

def portuguese_cleaners(text):

> ‘’'Basic pipeline for Portuguese text. There is no need to expand abbreviation and
> numbers, phonemizer already does that’’'
text = lowercase(text)
text = replace_symbols(text, lang=‘pt’)
text = remove_aux_symbols(text)
text = collapse_whitespace(text)
return text

But there is a section in config.json like

// DATA LOADING
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false, 
"num_loader_workers": 4,        
"num_val_loader_workers": 4,    
"batch_group_size": 0, .
"min_seq_len": 6,   
"max_seq_len": 153,  

// PHONEMES
“phoneme_cache_path”: “phoneme_cache/”,
“use_phonemes”: true,
“phoneme_language”: “en-us”,

Here it says phoneme_cleaners in the text_cleaner section. I want to use phoneme, not the normal alphabet for training. For this

“text_cleaner”: “phoneme_cleaners”

should the part “text_cleaner” remain like this? Or should I change it to this?

“text_cleaner”: “turkish_cleaners”,

My question is

It is written that phonemizer handles expanding abbreviation and numbers in cleaners.py’s function for portuguese. If phonemizer does it, can I get rid of typing the turkish_cleaner function using “phoneme_cleaners” as “text_cleaner”? How can I check if Phonemizer does expanding abbreviation and numbers for Turkish? How is phoneme_cleaner different from other cleaners and why is it used in config.json?

@nana_nan why don’t you join forces with @xox_oxo as you seem to be working on the same problem.

Check the phonemizer docs and test how it handles Turkish. You will usually need some sort of pre-processing. This is done by the cleaner function.

Install separately and use on command line.

It is in config.json to be able to switch quickly. So if you come up with a good Turkish cleaner, do a PR and others can profit from that. As you saw in the script, different cleaners perform different pre-processing steps.

1 Like