We are usually pinged to enable new languages in Common Voice to have the site localized and to gather new sentences.
I want to bring this conversation to community because, as we know, in order for a dataset to start to be effective, Deep Speech needs at least 2000hrs of voice validated and minimum 1000 different speakers.
What we should do with languages where, because of their side, it is not realistic they will manage to get 1000 speakers?
Is a smaller dataset still useful for other work not related with Deep Speech?
Smaller datasets can be more than useful for smaller domains.
I am developing a simple Welsh language voice assistant ap for Android using CommonVoice data, DeepSpeech and a simple language model. ("What is the weather? What is the news? Play me some music etc. ) . CommonVoice data (and DeepSpeech) is invaluable for us to begin developing such software. In time, I hope the app can stimulate more to contribute, and thus widen the number of commands and domains.
CommonVoice gives a powerful message to users and developers in smaller language communities - that they are not excluded, at least by one âtech giantâ, by virtue of their size, from the new speech based web paradigm, and that everyone can make a contribution. I hope Mozilla enganges more with these communities so that challenges and success are shared.
Itâs great to see Mozilla continue itâs support for minority languages with itâs invitation to all languages to contribute to Common Voice. Not all languages are in a similar situation due to population size, commercial and governmental support, but I believe that all languages communities deserve the ability to contribute towards ensuring their language is empowered by voice based technologies.
It may be that Deep Speech needs the data and speaker amounts you mention but weâre in a developing technological situation where technical and linguistic developments can make a big difference.
I hope weâre also in a situation where we would not wish to close the door on minoritized languages due to technical considerations.
Where possibly Mozilla could assist these situations would be to collect experiences of successful data collection campaigns and share them so that we can all learn from their successes, see Kabyle and Catalonian.
The public campaign in Wales to encourage contributors to Common Voice has raised the profile of Mozilla to a higher level than for many years, which has been great for technology in Wales, the Welsh language and Mozilla. Long may it continue!
To clarify: I think Common Voice should not only serve Deep Speech needs but also other applications we havenât though about, thatâs why having your experiences shared here is so important.
My goal is that the Common Voice community is self-sustainable in a way that it can provide value to different players, specially small players that are ignored by tech gigants
I think if people are willing to contribute, why not let them?
The data also might be useful in situations where what is actually being said doesnât matter too much. For instance, I am considering starting a project that would require me to get noise print samples so the language being spoken makes no difference to me.
Under the Common Voice umbrella there are actually two datasets being collected - text and voice.
I am not an expert on voice data, but textual dataset, even raw text without any annotations, could be very helpful for any applications like search, indexing, spellchecker, dictionary etc.
Many datasets on the market also donât have big data neither, actually most of the datasets I saw are only have dozen hours.
I believe although we may not have a thousand hours for most of the languages, common voice data can still be very valuable. Itâs better to have some than no at all anyway.
Many approaches are available for what are described as âlow-resource languagesâ in the literature, including transfer learning from models taught on high-resource languages. Every bit of data helps, and 1 is much better than 0 data. Also read about zero-resource learning, where there is actually no training on the target language until itâs time to do recognition! In this challenging case you might just start with something like a target language word list and a recognizer trained on another language (so not quite âzeroâ). As soon as you have even a small dataset with transcribed segments as in Common Voice, you should be able to do much better.
I believe the Deep Speech model, unmodified, is simply one of the highest-performing architectures when you have lots and lots of data available, but there are hundreds of other architectures around.
In my opinion, what needs the most attention now is prompt design (i.e. the sentence collector). With big recurrent neural networks, I think itâs really best to have very little repetition of prompts and of prompt wording. We also need to make sure weâre getting speaker IDs right, so that model developers can strictly partition speakers into training and validation sets.
" * Is a smaller dataset still useful for other work not related with Deep Speech?"
Very much so, if it is collected carefully, speaker metadata is recorded accurately, etc. Hereâs an example of an interesting (copyrighted) datasetâitâs collected from audio bibles in 700 different languages, and has been used to train the Festvox TTS (text to speech) system (I believe for all 700 languages): https://github.com/festvox/datasets-CMU_Wilderness Thereâs no way you wouldâve seen TTS systems for so many languages without the data.