Building an Arabic dataset for common voice

tinok · January 30, 2019, 4:24pm

Hi all, we are looking for collaborators to build out the various Arabic datasets, but we need to address some major questions first. So far we have 92% of interface localized and exactly 0 sentences submitted to Sentence Collector.

If I’m not mistaken, we can now start the process of adding sentences even while the UI localization is not complete (though only 33 strings are missing).

But, a major question needs to be decided on at this point: There is only one “Arabic” in Common Voice right now. Does this refer to Modern Standard Arabic (MSA)? If the goal of this platform is to “help teach machines how real people speak” then we need the different colloquial Arabic variants separate from MSA, i.e. ar-SA, ar-LB, etc. Google’s STT engine claims to support 15 different types of Arabic. As anyone with some knowledge of Arabic knows, both vocabulary and pronunciation vary greatly between different countries (and yes, sometimes within them). These different types of locally spoken Arabic are often considered as separate languages and should not be seen as equivalent to Irish, American, or Australian English.

So, I suggest renaming “Arabic” in the sentence collector to “Arabic (MSA)” and adding relevant locally spoken Arabic as separate languages.

For the purpose of language collection, it may make sense to start with MSA and then “translate” sentences to reflect local vocabulary, as needed.

But whereas it may be tempting to focus on MSA at first, it is not a language spoken naturally between most Arabic speakers, so for the purpose of creating a useful STT engine, MSA may not have much value.

I’d love other Arabic speakers, especially people with linguistics and translation backgrounds to weigh in.

Topic		Replies	Views
Add different Arabic Varieties (dialects) Common Voice	3	1303	April 27, 2020
Moroccan Arabic Localization Request Common Voice issue	3	1423	July 15, 2021
Arabic dataset and variants Common Voice participation , sentence-collection	4	1739	December 22, 2019
Languages addressed Common Voice	24	3894	May 15, 2018
How to "un-bias" a language? Common Voice	11	935	March 7, 2021

Building an Arabic dataset for common voice

Related topics