How to train with other languages

I want to train deepspeech on data using other languages, for example, Arabic which has different alphabet, I have read the wiki and understand now the importer and the csv file “wav_filename,wav_filesize,transcript”.

my question: given that English has the alphabet a,b,c and Arabic have alef, baa, taa. Can I use an Arabic transcript and update the alphabet file used on the inference time to use the Arabic alphabet?

Thank you,

I’m not sure I understand the problem here, the alphabet is also used at training time. So just put arabic characters as they appear in the transcriptions in your alphabet file :-). People have been able to do that with various non latin languages, so it should work as well in your case.

Actually there is no problem! i was confirming that training other languages is doable, now i should follow the training section here: , right?

Exactly. Make sure you use the proper versions of binaries if you train from master, since this will depend on TensorFlow r1.6, and if you try to use binaries from v0.1.1 for inference this will fail. I’ll let you search, this is already extensively documented here and on Github :-).

Thank you! I will proceed with finding/generating an Arabic data set, and will post the progress/results here.

You might also want to contribute that for Common Voice so that data collection in arabic can be done?

You mean here:

What I could understand from the website that if i got allowed to contribute, Arabic language will be added, then i will upload/speak/get others to speak in Arabic and attach the transcript and this will be a new Dataset. Is this right?

If so, I’m willing to contribute, just hint the starting point.

Yep, you should have a look at and reach for help there, but you got the basics right: localize website, build text dataset, and then you’re good to go!


hello sir can u help me to train deepspeech for arabic language on colab