Hi all, we are looking for collaborators to build out the various Arabic datasets, but we need to address some major questions first. So far we have 92% of interface localized and exactly 0 sentences submitted to Sentence Collector.
If I’m not mistaken, we can now start the process of adding sentences even while the UI localization is not complete (though only 33 strings are missing).
But, a major question needs to be decided on at this point: There is only one “Arabic” in Common Voice right now. Does this refer to Modern Standard Arabic (MSA)? If the goal of this platform is to “help teach machines how real people speak” then we need the different colloquial Arabic variants separate from MSA, i.e. ar-SA, ar-LB, etc. Google’s STT engine claims to support 15 different types of Arabic. As anyone with some knowledge of Arabic knows, both vocabulary and pronunciation vary greatly between different countries (and yes, sometimes within them). These different types of locally spoken Arabic are often considered as separate languages and should not be seen as equivalent to Irish, American, or Australian English.
So, I suggest renaming “Arabic” in the sentence collector to “Arabic (MSA)” and adding relevant locally spoken Arabic as separate languages.
For the purpose of language collection, it may make sense to start with MSA and then “translate” sentences to reflect local vocabulary, as needed.
But whereas it may be tempting to focus on MSA at first, it is not a language spoken naturally between most Arabic speakers, so for the purpose of creating a useful STT engine, MSA may not have much value.
I’d love other Arabic speakers, especially people with linguistics and translation backgrounds to weigh in.