I really love this idea of collecting the datasets and I’m wondering if there is a way to have the text include diacritics “short vowels” for languages like Arabic. I also think other languages like Chinese have that. I can help with that if it is needed?
If the sentences are supplied with diacritics then it is possible. I do not believe it is possible to add diacritics automatically with 100% precision. You can add new sentences using the sentence collector: https://common-voice.github.io/sentence-collector/ But it might be worth engaging with the existing community around Arabic to find out if this would be desirable.
Yeah, you are correct we can’t add diacritics automatically with 100% precision. I’ll try to start adding diacritized sentences and see how it works.
Thank you for the suggestion. Is there a specific way to reach the community around the Arabic language?
I don’t know the exact people who are involved, but you could try getting in contact with the people who did the localisation: https://pontoon.mozilla.org/ar/common-voice/contributors/