Support needed to get more sentences in Persian

Hello everyone,

I’ve just noticed that Persian (fa) has validated 276hrs of voice

But I’ve just checked and Persian only has 12K sentences on the system, which means people are recording them again and again, something we know it’s not ideal for the quality of the dataset.

This is a call to action to Persian speakers with technical knowledge to help with the Persian wikipedia extraction:

Important: Please do not use the sentence collector to send wikipedia sentences, we must use the process describe in the link above.

This would allow the project to have way more sentences without repetitions, increasing the quality of the Persian dataset.

Thanks!

3 Likes

I shared it with a Persian speaking Computer Scientist friend.

3 Likes

I’ve followed this principle and submitted a lot of sentences in Common Voice Sentence Collector:

Extending our sentence collection capabilities : We are able to use sentences from Wikipedia as long as we don’t extract more than 3 random sentences per article.

But most of them are rejected without any reason.

vox, do you mean you have sent wikipedia sentences to the sentence collector?

If that’s the case we will have to delete them, since as you can see on the linked topic, wikipedia sentences have some legal requirements that we must follow using the special process described there.

1 Like