Request for an Arabic dataset

abdelrahmanka96 · February 6, 2022, 3:48pm

Hello.

I am searching for an Arabic dataset to train a DeepSpeech model. This model is to be compared with other models that have been trained on datasets with different languages. To keep the comparisons fair, we need to train on a few thousand hours of data. The biggest publicly available dataset I could find was MGB-2 (1500 hours) But unfortunately, it is not properly segmented (Long instances of silence, music can be heard for extended periods, etc) and therefore is not the best option. I found a few other datasets (e.g: Common Voice, Arabic Speech Corpus) But these are too small (even put together they produce less than 100 hours of data).

If anyone can provide resources for more Arabic datasets to help in my search, I would be very grateful.

I apologize if this is not the place to post this.

omh · March 3, 2022, 10:24pm

hello ialso trying to train deepspeech model on arabic language using commonvoice dataset did u try to train on common voice dataset ?if yes what is the wer u achieved?

Topic		Replies	Views
Using common voice datasets? DeepSpeech	5	1098	November 17, 2020
Need to create Arabic models DeepSpeech	3	595	November 24, 2020
Smaller commonvoice dataset Common Voice learning , feedback	0	1199	September 2, 2020
Train French Model DeepSpeech	4	585	May 15, 2019
Training DeepSpeech in reinforcement learning envoirment DeepSpeech	2	428	April 30, 2020

Request for an Arabic dataset

Related topics