I think that right now is not a problem, looking at the project to grow there are different goals as thinking in agile context.
- October community campaign Promote better Common Voice to have more recordings or what we are doing in any case is useless. The corpus in a lot of language is not still at the goal defined 2 years ago when the project started (English as first example of 2000 hours)
- https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/ the error rate is very low and we can focus as community when all the languages or the majority reached the hours recording goal
- Add support for accents 🗣 Feedback needed: Languages and accents strategy because we are excluding some area and also lot of information in the dataset (as example the DS French model use a dataset of African French accents)
Remember for anyone that want to join the discussion that without promotion and awareness about the project it is useless what we are doing. As example now is an year that I invest a lot of time for Italian with the community etc, and I prefer to see more people that use the project and understand why they have to contribute.
The Italian case is emblematic, we have a lot of contributors 799 with 47 hours compared to German with 7300~ and 396 hours or French 7000~ with 267 hours (numbers from https://voice.mozilla.org/it/languages).
We have more people compared to the recording hours of those languages but we are not able to keep contributors be part of the project and I think that we need that to move on the project also to other discussion.
Bad words list is a way to achieve a result to get more people, is true but right now we are not able to keep the people we have to contribute that is the big problem, at least from my point of view.
So getting back this thread to the discussion, maybe we can do in the scraper support for a second list of sentences with profanities etc and based on the age of the user CV will show them.