Moroccan Arabic Localization Request

A follow up to the Github issue : 3163

There are different types of locally spoken Arabic and are often considered as separate languages, Moroccan Arabic is one of them. And should not be seen as equivalent to Standard Arabic as it is widely different from vocabulary, sound, words meanings, sentence construction, etc…

Our goal is an NLP project that needs a data set of speech in Moroccan Arabic, we can’t add sentences from this latter on the Standard Arabic platform, nor can we use Standard Arabic speech data set for a Moroccan Arabic project.

Our request is to add Moroccan Arabic as a language, separated from Standard Arabic.


What are some of the reasons that you can’t add Moroccan Arabic to the ar platform? Are there no sentences that are equivalent in both languages? Nearly all languages exhibit variation, and for large languages there is often a substantial amount of variation in all of the ways you state.

This paper states that according to their dataset, the vocabulary overlap is 86%. But vocabulary overlap is not the most important. It would be interesting to read more quantitative works on the subject.

Note that this is not just a question about Moroccan, but it will define how Arabic is treated in general within the platform, so e.g. Egyptian, Algerian, Tunisian, Levantine, etc. are also relevant. It would be also good to have contributors to the Arabic (ar) code contribute their opinions.

Here are some relevant posts:

It might be worth contacting some of the participants in those discussions to come up with a plan.

In principle, it is possible to create a localization for a variant of a language. Here are all the necessary steps to include a new language to Common Voice: 📖 Readme: How to see my language on Common Voice

In my opinion, the hardest part for Moroccan Arabic will be to collect enough sentences. There is a Moroccan Arabian version of Wikipedia, but it only has around 4000 Articles:

You can use the Sentence Extractor and it will give you a maximum of 12 000 sentences, and this could get you started. But if you really want to create a speech recognition, you will need hundreds of thousands of sentences. So this means a lot of manual work in the sentence collector.


Thanks for creating the topic discussion and everyone’s input.

Similar to what has been suggested, you might want to reach out to the contributors who set up Arabic on Common Voice, to even sharing learnings. You can see their contact details on Pontoon.

Regarding Langauge and Accent overall on Common Voice

We want to design a holistic approach to languages and accents that can work across communities. Following community feedback about the current challenges, this is a priority for the 2021/22 roadmap (see post on August open sessions to engage with this!) The team is starting to gather input and insights gathered from research scientists, ML engineers, linguistic experts, and community members to map out new language workflows and accent capture mechanisms. These will be opened up to the community for discussion and user testing, so keep an eye out for those posts!

1 Like