Building an Arabic dataset for common voice

Hello, I want to contribute in starting the Arabic dataset on common voice, I’m a nerd with 18 years of software engineering mess, How can we do that :slight_smile:?

Thank you,

2 Likes

The first step would be localizing the website into Arabic. If you are interested, send an email to me (mikey [at] mozilla [dot] com) and Peiying (pmo [at] mozilla [dot] com) and we will get you set up.

1 Like

Done, Email sent
Done, Email sent

2 Likes

Dear Michael,

Do you have a plan to launch Arabic support soon ?

I am also interested in this project. Could you tell me when the website will start collecting speech ?

Best regards,
Tony

Hi all, we are looking for collaborators to build out the various Arabic datasets, but we need to address some major questions first. So far we have 92% of interface localized and exactly 0 sentences submitted to Sentence Collector.

If I’m not mistaken, we can now start the process of adding sentences even while the UI localization is not complete (though only 33 strings are missing).

But, a major question needs to be decided on at this point: There is only one “Arabic” in Common Voice right now. Does this refer to Modern Standard Arabic (MSA)? If the goal of this platform is to “help teach machines how real people speak” then we need the different colloquial Arabic variants separate from MSA, i.e. ar-SA, ar-LB, etc. Google’s STT engine claims to support 15 different types of Arabic. As anyone with some knowledge of Arabic knows, both vocabulary and pronunciation vary greatly between different countries (and yes, sometimes within them). These different types of locally spoken Arabic are often considered as separate languages and should not be seen as equivalent to Irish, American, or Australian English.

So, I suggest renaming “Arabic” in the sentence collector to “Arabic (MSA)” and adding relevant locally spoken Arabic as separate languages.

For the purpose of language collection, it may make sense to start with MSA and then “translate” sentences to reflect local vocabulary, as needed.

But whereas it may be tempting to focus on MSA at first, it is not a language spoken naturally between most Arabic speakers, so for the purpose of creating a useful STT engine, MSA may not have much value.

I’d love other Arabic speakers, especially people with linguistics and translation backgrounds to weigh in.

1 Like

I’m curious about this, are the differences how people pronounce the words, but the words are the same?

Both - different pronunciations of the same words and very different vocabulary overall–especially for everyday items. See here and here for examples. Sometimes the difference between MSA and local variants is compared to Latin and the various Romance languages, though Latin is no longer used by anyone as a lingua franca. Students of MSA can read news and watch TV, but need to re-learn a local Arabic variation in order to follow conversations with normal people.

1 Like

Hello everyone,
@tinok You explain it very well :blush: thanks.
I’m Ruba From Palestine a new contributor in Localization team, and im interested in this project too.
I’m working in Pontoon and yes the missing 33 is still the same, i will try to localize them and will ask the Arabic Mangers to have a look and approve them :slight_smile:
at the mean while, what can I do ??

Great to have you on board @ruba.awayes! I would suggest that we start collecting some sentences in MSA as a starting point. Whether or not there will be a separate sentence collection, e.g. for Palestinian or Jordanian Arabic, still seems up to discussion. But having a dataset in MSA would be good as a template either way. You could add them directly to the Sentence Collector platform.

I haven’t found an existing source for public domain sentences in Arabic that could be imported. NYU’s Arabic Collection Online has lots of books in the public domain that could be used, but they are scanned and would require manual transcription.

Edit: A possible starting point would be to use the large number of English sentences available in various files on GitHub and translate them into Arabic (minus typical English idioms, obviously).

Thank you so much @tinok for your helpful reply.
I will spread the word also to others so that they can contribute :slight_smile:

1 Like

A quick update: We just uploaded 170 new Arabic (MSA) sentences to the sentence collector to be verified. These sentences were machine translated from the verified English corpus and verified for accuracy by a native speaker. So far 76% of the translations were accurate.

Please help review them here.

We have another 3000 sentences ready to go but need more volunteers: If you can, please open this spreadsheet and mark any sentence that is correct as ‘1’. We will upload verified sentences every few days.

I hope with this method we can get to 5,000 more quickly and start recording audio.

We should still collect sentences from other sources, especially colloquial / conversational speech, and phrases with non-MSA Arabic words.

UPDATE: I have currently uploaded over 14,000 new Arabic (MSA) sentences to the sentence collector to be verified.

The community’s support in verifying them is greatly appreciated!

1 Like

UN documents are PD. We used it for Russian, you may find it useful for Arabic as well.

There is parallel corpus available https://cms.unov.org/UNCorpus/

I wrote some one-time js scripts for this task, it’s poor quality and not very optimal, but it should work after some modifying. I provide it under CC0 1.0, but take note that some functions in uploader were stolen from sentence-collector and distributed under MPL 2.0, if you care. Here is parser which checks, normalize, and extract sentences from UNCorpus (only PV records are used) https://hastebin.com/ekayeroyuz.js and here is uploader https://hastebin.com/esoqupepil.js

Hello,
I am also Interested . I have sent an email to pmo@mozilla.com . (the other email is unreachable!)
thanks for you job.

Hi there,

You can contribute directly from Mozilla’s Pontoon

https://pontoon.mozilla.org/ar/common-voice/

Also, please check this topic for reference:

Thanks for your contributions!

1 Like

I am a native Arabic speaker, and as a native Arabic speaker I would say focusing on MSA is the right thing to do and the more practical thing as well. Here are the reasons:

1-As you said, there are 15 different Arabic dialects and if you focus on MSA your work will be relevant to all the 20 countries/~350 million speakers, whereas if you choose a local dialect you’ll be relevant to basically 1/15 of the population.

2- All of those countries teach MSA in their schools, so MSA is the common standard language that everyone understands.

3- Add to the complexity the fact that dialects are not written languages so there is basically no standards to how words should be spelled. Different people spell words differently. Even the same person may spell the same word different in different occasions. There is no standard and if you are writing in a dialect it usually means you are writing to some friends or relatives, so you basically do what you want.

4- While dialects are commonly used in conversations, the written language is usually MSA. So, if someone is using your system to generate text it’s more likely that they want to generate MSA.

5- If you can’t support all the different dialects (at least the more common ones) you might put yourself in the middle of political debates. I’ve seen debates and swearing in online forums of people discussing why such and such game was dubbed into Egyptian rather than Saudi or MSA, etc.

In short, supporting MSA is just way easier and less problematic and as an Arabic speaker myself I would rather see a perfected product even if it only understands MSA rather than seeing my local dialect half done because the other half of the effort is spent on some other dialect.

Regards,
Sarmad

Hello everybody,
I am a native arabic speaker, I noticed that during the contribution, the sentences to read are sometimes ambiguous.
The reason for this ambiguity is that in Arabic, there is no vowel, so we ,Arabic speakers, know how to pronouce a word/sentece depending on the context and the grammar!
Indeed, there are quite a few words in Arabic that are written in the same way, but with completely different pronunciation and in this case a completely different definition !
So, to remedy to this, it existes in arabic Al-Harakat ( Arabic diacritics ) that can, as vowels, shows us how to pronouce a word/sentence ! ( https://en.wikipedia.org/wiki/Arabic_diacritics )
I can give an Example : أنس can be read : Anas, Ouns, Anis, Annassa, etc…
With the harakat we can distinguish the prononciations and then the meening : أَنَسْ for Anas,أُنْسْ for Ouns,أَنِسْ for Anis, أَنَّسْ for Annassa, etc…
The arabic sentences in the CV Project are setences with out the Harakat, so for some senteces it is ambigous and hard to know how to read ! especially most of the senteces are poeme lines and some of them aged from the Medieval ages !
I wanted to notice you guys about this issue, so we can contribute and maintain the quality !
Thank you!