Building an Arabic dataset for common voice

sentence-collection

(Ahmedadly) #1

Hello, I want to contribute in starting the Arabic dataset on common voice, I’m a nerd with 18 years of software engineering mess, How can we do that :slight_smile:?

Thank you,


How to train with other languages
(Michael Henretty) #2

The first step would be localizing the website into Arabic. If you are interested, send an email to me (mikey [at] mozilla [dot] com) and Peiying (pmo [at] mozilla [dot] com) and we will get you set up.


(Ahmedadly) #3

Done, Email sent
Done, Email sent


(Tonytonyissaissa) #4

Dear Michael,

Do you have a plan to launch Arabic support soon ?

I am also interested in this project. Could you tell me when the website will start collecting speech ?

Best regards,
Tony


(Tino Kreutzer) #5

Hi all, we are looking for collaborators to build out the various Arabic datasets, but we need to address some major questions first. So far we have 92% of interface localized and exactly 0 sentences submitted to Sentence Collector.

If I’m not mistaken, we can now start the process of adding sentences even while the UI localization is not complete (though only 33 strings are missing).

But, a major question needs to be decided on at this point: There is only one “Arabic” in Common Voice right now. Does this refer to Modern Standard Arabic (MSA)? If the goal of this platform is to “help teach machines how real people speak” then we need the different colloquial Arabic variants separate from MSA, i.e. ar-SA, ar-LB, etc. Google’s STT engine claims to support 15 different types of Arabic. As anyone with some knowledge of Arabic knows, both vocabulary and pronunciation vary greatly between different countries (and yes, sometimes within them). These different types of locally spoken Arabic are often considered as separate languages and should not be seen as equivalent to Irish, American, or Australian English.

So, I suggest renaming “Arabic” in the sentence collector to “Arabic (MSA)” and adding relevant locally spoken Arabic as separate languages.

For the purpose of language collection, it may make sense to start with MSA and then “translate” sentences to reflect local vocabulary, as needed.

But whereas it may be tempting to focus on MSA at first, it is not a language spoken naturally between most Arabic speakers, so for the purpose of creating a useful STT engine, MSA may not have much value.

I’d love other Arabic speakers, especially people with linguistics and translation backgrounds to weigh in.


Problems finding public domain sentences
(Rubén Martín) #6

I’m curious about this, are the differences how people pronounce the words, but the words are the same?


(Tino Kreutzer) #7

Both - different pronunciations of the same words and very different vocabulary overall–especially for everyday items. See here and here for examples. Sometimes the difference between MSA and local variants is compared to Latin and the various Romance languages, though Latin is no longer used by anyone as a lingua franca. Students of MSA can read news and watch TV, but need to re-learn a local Arabic variation in order to follow conversations with normal people.


(Ruba Awayes) #8

Hello everyone,
@tinok You explain it very well :blush: thanks.
I’m Ruba From Palestine a new contributor in Localization team, and im interested in this project too.
I’m working in Pontoon and yes the missing 33 is still the same, i will try to localize them and will ask the Arabic Mangers to have a look and approve them :slight_smile:
at the mean while, what can I do ??


(Tino Kreutzer) #9

Great to have you on board @ruba.awayes! I would suggest that we start collecting some sentences in MSA as a starting point. Whether or not there will be a separate sentence collection, e.g. for Palestinian or Jordanian Arabic, still seems up to discussion. But having a dataset in MSA would be good as a template either way. You could add them directly to the Sentence Collector platform.

I haven’t found an existing source for public domain sentences in Arabic that could be imported. NYU’s Arabic Collection Online has lots of books in the public domain that could be used, but they are scanned and would require manual transcription.

Edit: A possible starting point would be to use the large number of English sentences available in various files on GitHub and translate them into Arabic (minus typical English idioms, obviously).


(Ruba Awayes) #10

Thank you so much @tinok for your helpful reply.
I will spread the word also to others so that they can contribute :slight_smile: