Hello.
I am searching for an Arabic dataset to train a DeepSpeech model. This model is to be compared with other models that have been trained on datasets with different languages. To keep the comparisons fair, we need to train on a few thousand hours of data. The biggest publicly available dataset I could find was MGB-2 (1500 hours) But unfortunately, it is not properly segmented (Long instances of silence, music can be heard for extended periods, etc) and therefore is not the best option. I found a few other datasets (e.g: Common Voice, Arabic Speech Corpus) But these are too small (even put together they produce less than 100 hours of data).
If anyone can provide resources for more Arabic datasets to help in my search, I would be very grateful.
I apologize if this is not the place to post this.