Arabic dataset and variants

Yaron_Shahrabani · December 19, 2019, 10:02am

Hey guys,
I’m not a native Arabic speaker but I can read, write and speak Arabic on a pretty basic level, I tried contributing to the Arabic dataset on Common Voice, since I’m not sure my accent is good enough I tried verifying other people’s voice, apparently many speakers believe that they are reading a formal text so they are using a formal variation of the words (Adding Tanween/un at the end) which does not reflect how Arab speakers communicate with their surroundings.

Now, since Arabic is currently not separated into variants speakers can only use the formal form in recordings because they think that if they’ll speak their own colloquial it won’t be understood by others.

So in order for this dataset to be more accurate some steps should be made:

Adding variants to Arabic (Just like English (Phillipines) for example).
Maybe adding a banner explaining that people should use their everyday pronunciation instead of trying and mimicking the news anchors.
Adding a banner for the validators explaining that the should only verify whatever sounds natural (even if it’s not their colloquial they should be able to identify Fus’ha from an everyday colloquial) and potentially disapprove all formal Arabic because it’s unnatural.

I’d love to have some more comments from native speakers because I’m still not sure this is the best way to go.

Thanks!

nukeador · December 19, 2019, 12:04pm

Hi,

In 2020 we plan to change how we capture the different sounds for the same language, see:

https://discourse.mozilla.org/t/feedback-needed-languages-and-accents-strategy/

We also want to provide more context on the recording and listen screens so people have more clarity. This is definitely something on our backlog for 2020.

Yaron_Shahrabani · December 19, 2019, 12:52pm

Yeah, that’s a major part of it, but I also think it needs further explanation due to the unnecessary formality.

nukeador · December 19, 2019, 12:54pm

Yes, that’s definitely something we can include for other languages too.

Yaron_Shahrabani · December 22, 2019, 10:04am

So if I understand correctly:
You support colloquials separation plus adjusting the texts accordingly.

Let me try and work the following subjects with you and we can have a clearer view about the requirements:

When people talk to Alexa or Google Home for example, are they used to speaking to their home technology in Fus’ha?
Do you think Fus’ha should be an additional variant or non-existent?
Why do you think people won’t contribute to dialects? The mission is to make my surrounding technology speak and understand the way I speak regardless if I’m Iraqi, Lebanese or Israeli.
Do you think the current sentences are too formal? Maybe we should recommend a more approachable sentences?

I understand the rules of the language and how it should be pronounced but this project is about true people and their interface with IoT devices in the most native way they know, not about talking to your devices as if they were formal government employees or as if you were reading the news.

Topic		Replies	Views
Building an Arabic dataset for common voice Common Voice sentence-collection	16	4907	March 13, 2021
Add different Arabic Varieties (dialects) Common Voice	3	1308	April 27, 2020
Ask Me Anything (AMA) session on Common Voice Variants for Languages Common Voice participation	5	2327	January 24, 2022
[Feature request] Dialects/language variants Common Voice sentence-collection	0	570	March 22, 2023
Moroccan Arabic Localization Request Common Voice issue	3	1426	July 15, 2021

Arabic dataset and variants

Related topics