Persian/Farsi TTS

That’s awesome, I emailed you so that we can start cooperating.

1 Like

@othiele How could this repo helps to see the modification that sould be done to train TTS on another language?

1 Like

Usually one learns by copying what others are doing and then you adapt it to your own needs. So I suggested you copy how we did it and then you can change stuff for Persian. If you already did that, what exactly is your question? And as @sanjaesc said, please don’t ask us do it all for you. Btw, @synesthesiam offered his excellent repo as well. Same there, study and then ask detailed questions.

2 Likes

Thank you, @othiele. If you do come across people who aren’t able to do it themselves for under-served languages, please send them my way. I’m willing to do a lot of the work as long as I have a native speaker to consult with :slight_smile:

1 Like

@synesthesiam Great to hear that and thanks for all the work you put into new models. Will send people your way :slight_smile:

1 Like

Are your models open? Would you mind sharing to link on https://github.com/mozilla/TTS/wiki/Released-Models

I’m interested in Farsi TTS. What is the status of your work, @i3130002 @synesthesiam? Do you need help?

1 Like

I did nothing and was unable to communicate much with him. I had to put off this project as of some personal problems, Though, I imagine using his github you should be able to start working on Farsi/Persian and if I could do something just email me (on gmail).
Wish you luck :grinning:

2 Likes

Sorry I haven’t been very responsive, @i3130002. I should have some more time now during the holidays.

I’ve made some progress, but I need help now :slight_smile: (see below)

I was able to find some Farsi speech data: I contacted the author of the MirasVoice corpus and got the full set. Unfortunately, it doesn’t have enough data from a single speaker. I might be able to use it in the future for Farsi speech to text in Rhasspy, but it’ll need a lot of pre-processing.

So I will need to collect recordings from a volunteer. But first, I need to develop a set of sentences that have good phoneme coverage. I’ve already added Farsi phonemes to my gruut-ipa library, and I’ve located a large corpus of sentences in OSCAR. But here’s where I’m stuck: numbers.

I use the num2words library to convert digits into words (1 -> one), and it doesn’t support Farsi yet. Would either of you (@i3130002 or @hkalbasi) be able to help me add support?

Once I can convert numbers to words, I can filter the OSCAR Farsi sentences and find a small set of sentences (usually < 2000) that will provide good phoneme pair examples. After that, we’ll need to find a volunteer with a good microphone and a lot of patience. With this dataset, I’ll be happy to train models for both MozillaTTS and my Larynx fork.

EDIT: Forgot to add one more step: filtering sentences. I usually start with a set of 2000-5000 phonetically rich sentences, and then ask volunteers to help filter out ones that don’t make sense, are offensive somehow, or are something that a real native speaker would never say. This can be done by multiple people in parallel at least, but it’s an important step :+1:

1 Like

I just made a pull request for adding Farsi in num2words library.

Why we don’t use common voice sentences? There are near 7000 sentences in Farsi, which is reviewed and does not have numbers.

My friend is volunteer for recording the dataset. But I think we don’t have good microphone. Is IPhone’s microphone considered good? Can we solve this by software? Please message me for more details in an instant messaging app so I can send you samples of my friend voice. (Matrix: @hkalbasi:mozilla.org , telegram: @hkalbasi)

And another concern: In Farsi, characters like e, a, o are optional. For example کِتاب which means book is ketaab and کَتاب which is an invalid word is kataab (notice the small character moves from down to up, I hope your browser show that, but I can also send image) and everyone write کتاب without any a or e. This maybe is possible to handle by a dictionary. But the problem becomes more difficult in genitive case, which is connected in Farsi by e. For example my book in farsi is کِتابِ مَن = ketaabe man. But every one write it کتاب من and espeak read it ketaab man which is wrong. How we can handle that so our machine can read simple texts without ـَ ـِ ـُ explicity declared?

2 Likes

(my reply got lost for some reason, so I’m re-typing it)

@hkalbasi, I’ve pulled your changes into my num2words fork. Thank you for such a quick response!

This is a great idea. I looked at the (now) 9000 validated Farsi sentences, and ran them through my phoneme coverage analysis. It looks like we could get excellent coverage even with half (~4000), but the more data the better.

How many sentences is your volunteer willing to read? I contacted you on Matrix; I’ll have to listen to some samples from the iPhone to see if it will be good enough.

I took some time this afternoon to look into this. In your example, is the “e” in “ketaabe man” pronounced? If so, none of the grapheme-to-phoneme systems I tried were able to produce the correct pronunciation.

A dictionary approach can help, but I may need to do some more research if this is a common problem.

3 Likes

Hi, is there any progress? I’m really looking forward to it.
I’d be glad to help the progress of persian TTS,
Cheers!

1 Like

Last I checked, @hkalbasi’s friend had recorded about 400 phrases out of the 2400.

If you’re interested in recording, please let me know :slight_smile:

1 Like

Of course, I’d like to help :slight_smile:
BTW I’m also looking for an offline Persian speech recognition in Python and it would be awesome if someone could help :pray:

I have about 400 hours of Persian speech data, which would be a good start for training a Kaldi speech recognition model.

Unfortunately, some of the audio is not aligned with the corresponding text. I have an audio book, for example, whose chapters are PDFs. If you would be willing to help get this data split into (sentence, text) pairs, I will train the Kaldi model and add it to Rhasspy.

1 Like

umm, I’d like to but it kinda depends on the effort because I’m looking for this stuffs for a personal project and BTW there are two things, firstly there is AlisterTA’s Persian TTS which I’ve recently been working on and secondly why don’t you just use the Mozilla common voice dataset?

The 400 hours includes Common Voice, which is about 270 hours (last I checked). I could try with just Common Voice and see how good the model is.

I came across the AliterTA Persian TTS project when searching for speech data. Do you have the 30 hour dataset mentioned there? If you’re willing to share it under a Creative Commons license, I will train a TTS model for you.

1 Like

I’m sorry but I don’t have that db and I did a little research, as he himself tweeted he is not gonna share the database because he bought some audiobooks from Fidibo and other websites and so he will break copy-right law by sharing the database :man_shrugging:

That’s a real shame, so much data :frowning:

If you have a good microphone, and you’re willing to record data that will be shared, PM me :slight_smile:

Yeah :slightly_frowning_face:
I think I could do that and the mic is my laptop’s which is not so bad :man_shrugging:
I can email you if you want, so you can inform me about how the data should actually be and so on…
Thanks :pray:

1 Like