Arabic dataset and variants

Hey guys,
I’m not a native Arabic speaker but I can read, write and speak Arabic on a pretty basic level, I tried contributing to the Arabic dataset on Common Voice, since I’m not sure my accent is good enough I tried verifying other people’s voice, apparently many speakers believe that they are reading a formal text so they are using a formal variation of the words (Adding Tanween/un at the end) which does not reflect how Arab speakers communicate with their surroundings.

Now, since Arabic is currently not separated into variants speakers can only use the formal form in recordings because they think that if they’ll speak their own colloquial it won’t be understood by others.

So in order for this dataset to be more accurate some steps should be made:

  1. Adding variants to Arabic (Just like English (Phillipines) for example).
  2. Maybe adding a banner explaining that people should use their everyday pronunciation instead of trying and mimicking the news anchors.
  3. Adding a banner for the validators explaining that the should only verify whatever sounds natural (even if it’s not their colloquial they should be able to identify Fus’ha from an everyday colloquial) and potentially disapprove all formal Arabic because it’s unnatural.

I’d love to have some more comments from native speakers because I’m still not sure this is the best way to go.



In 2020 we plan to change how we capture the different sounds for the same language, see:

We also want to provide more context on the recording and listen screens so people have more clarity. This is definitely something on our backlog for 2020.

Yeah, that’s a major part of it, but I also think it needs further explanation due to the unnecessary formality.

Yes, that’s definitely something we can include for other languages too.

So if I understand correctly:
You support colloquials separation plus adjusting the texts accordingly.

Let me try and work the following subjects with you and we can have a clearer view about the requirements:

  1. When people talk to Alexa or Google Home for example, are they used to speaking to their home technology in Fus’ha?
  2. Do you think Fus’ha should be an additional variant or non-existent?
  3. Why do you think people won’t contribute to dialects? The mission is to make my surrounding technology speak and understand the way I speak regardless if I’m Iraqi, Lebanese or Israeli.
  4. Do you think the current sentences are too formal? Maybe we should recommend a more approachable sentences?

I understand the rules of the language and how it should be pronounced but this project is about true people and their interface with IoT devices in the most native way they know, not about talking to your devices as if they were formal government employees or as if you were reading the news.