Building an Arabic dataset for common voice

Hello, I want to contribute in starting the Arabic dataset on common voice, I’m a nerd with 18 years of software engineering mess, How can we do that :slight_smile:?

Thank you,

1 Like

The first step would be localizing the website into Arabic. If you are interested, send an email to me (mikey [at] mozilla [dot] com) and Peiying (pmo [at] mozilla [dot] com) and we will get you set up.

1 Like

Done, Email sent
Done, Email sent

2 Likes

Dear Michael,

Do you have a plan to launch Arabic support soon ?

I am also interested in this project. Could you tell me when the website will start collecting speech ?

Best regards,
Tony

Hi all, we are looking for collaborators to build out the various Arabic datasets, but we need to address some major questions first. So far we have 92% of interface localized and exactly 0 sentences submitted to Sentence Collector.

If I’m not mistaken, we can now start the process of adding sentences even while the UI localization is not complete (though only 33 strings are missing).

But, a major question needs to be decided on at this point: There is only one “Arabic” in Common Voice right now. Does this refer to Modern Standard Arabic (MSA)? If the goal of this platform is to “help teach machines how real people speak” then we need the different colloquial Arabic variants separate from MSA, i.e. ar-SA, ar-LB, etc. Google’s STT engine claims to support 15 different types of Arabic. As anyone with some knowledge of Arabic knows, both vocabulary and pronunciation vary greatly between different countries (and yes, sometimes within them). These different types of locally spoken Arabic are often considered as separate languages and should not be seen as equivalent to Irish, American, or Australian English.

So, I suggest renaming “Arabic” in the sentence collector to “Arabic (MSA)” and adding relevant locally spoken Arabic as separate languages.

For the purpose of language collection, it may make sense to start with MSA and then “translate” sentences to reflect local vocabulary, as needed.

But whereas it may be tempting to focus on MSA at first, it is not a language spoken naturally between most Arabic speakers, so for the purpose of creating a useful STT engine, MSA may not have much value.

I’d love other Arabic speakers, especially people with linguistics and translation backgrounds to weigh in.

I’m curious about this, are the differences how people pronounce the words, but the words are the same?

Both - different pronunciations of the same words and very different vocabulary overall–especially for everyday items. See here and here for examples. Sometimes the difference between MSA and local variants is compared to Latin and the various Romance languages, though Latin is no longer used by anyone as a lingua franca. Students of MSA can read news and watch TV, but need to re-learn a local Arabic variation in order to follow conversations with normal people.

Hello everyone,
@tinok You explain it very well :blush: thanks.
I’m Ruba From Palestine a new contributor in Localization team, and im interested in this project too.
I’m working in Pontoon and yes the missing 33 is still the same, i will try to localize them and will ask the Arabic Mangers to have a look and approve them :slight_smile:
at the mean while, what can I do ??

Great to have you on board @ruba.awayes! I would suggest that we start collecting some sentences in MSA as a starting point. Whether or not there will be a separate sentence collection, e.g. for Palestinian or Jordanian Arabic, still seems up to discussion. But having a dataset in MSA would be good as a template either way. You could add them directly to the Sentence Collector platform.

I haven’t found an existing source for public domain sentences in Arabic that could be imported. NYU’s Arabic Collection Online has lots of books in the public domain that could be used, but they are scanned and would require manual transcription.

Edit: A possible starting point would be to use the large number of English sentences available in various files on GitHub and translate them into Arabic (minus typical English idioms, obviously).

Thank you so much @tinok for your helpful reply.
I will spread the word also to others so that they can contribute :slight_smile:

1 Like

A quick update: We just uploaded 170 new Arabic (MSA) sentences to the sentence collector to be verified. These sentences were machine translated from the verified English corpus and verified for accuracy by a native speaker. So far 76% of the translations were accurate.

Please help review them here.

We have another 3000 sentences ready to go but need more volunteers: If you can, please open this spreadsheet and mark any sentence that is correct as ‘1’. We will upload verified sentences every few days.

I hope with this method we can get to 5,000 more quickly and start recording audio.

We should still collect sentences from other sources, especially colloquial / conversational speech, and phrases with non-MSA Arabic words.

UPDATE: I have currently uploaded over 14,000 new Arabic (MSA) sentences to the sentence collector to be verified.

The community’s support in verifying them is greatly appreciated!

1 Like

UN documents are PD. We used it for Russian, you may find it useful for Arabic as well.

There is parallel corpus available https://cms.unov.org/UNCorpus/

I wrote some one-time js scripts for this task, it’s poor quality and not very optimal, but it should work after some modifying. I provide it under CC0 1.0, but take note that some functions in uploader were stolen from sentence-collector and distributed under MPL 2.0, if you care. Here is parser which checks, normalize, and extract sentences from UNCorpus (only PV records are used) https://hastebin.com/ekayeroyuz.js and here is uploader https://hastebin.com/esoqupepil.js

Hello,
I am also Interested . I have sent an email to pmo@mozilla.com . (the other email is unreachable!)
thanks for you job.

Hi there,

You can contribute directly from Mozilla’s Pontoon

https://pontoon.mozilla.org/ar/common-voice/

Also, please check this topic for reference:

Thanks for your contributions!

1 Like