I’m not a native Arabic speaker but I can read, write and speak Arabic on a pretty basic level, I tried contributing to the Arabic dataset on Common Voice, since I’m not sure my accent is good enough I tried verifying other people’s voice, apparently many speakers believe that they are reading a formal text so they are using a formal variation of the words (Adding Tanween/un at the end) which does not reflect how Arab speakers communicate with their surroundings.
Now, since Arabic is currently not separated into variants speakers can only use the formal form in recordings because they think that if they’ll speak their own colloquial it won’t be understood by others.
So in order for this dataset to be more accurate some steps should be made:
- Adding variants to Arabic (Just like English (Phillipines) for example).
- Maybe adding a banner explaining that people should use their everyday pronunciation instead of trying and mimicking the news anchors.
- Adding a banner for the validators explaining that the should only verify whatever sounds natural (even if it’s not their colloquial they should be able to identify Fus’ha from an everyday colloquial) and potentially disapprove all formal Arabic because it’s unnatural.
I’d love to have some more comments from native speakers because I’m still not sure this is the best way to go.