How the Italian DeepSpeech model helped our Mozilla Italia community

I wrote a long article about how my community got the Italian deepspeech model with an active community organization with all the learnings and issues we found.
I hope that can be inspirational for others and help moving on this project, take it as Christmas present.

Just add your comments below :smiley:


Thanks for sharing @Mte90

Do you have more details on the reaction from press and communities before and after you showcase the Italian model? I’m really interested in reading the feedback you got from them so far.

I see you mention a few existing problems around sentences. Currently we have a process to remove from the wikipedia import any number of problematic sentences at once, and the plan for 2020 is to evolve the tool to also extract sentences from other open sources (like the European Parliament dataset). I think this is the way to go in order to get a lot of quality sentences.

Considering we have already enough sentences for Italian for a long time, are there other big issues/blockers in the rest of the workflow? (voice recording, voice validation, dataset release, model training)


Reaction from press is 0 because we didn’t promotion for this but we have to probably on january when the next dataset will be released.
Communities right now only more interest in our community but again with the next dataset probably the model will be better so it will be more easy to gather more interests.


Our blockers mainly now are getting in touch with the people that contribute on CV to understand better what to do, right now we are working on the promotional videos in Italian to explain the project.

I see. In order to ensure privacy we don’t allow or expose people to other users email addresses on the site, but we are actively working on being able to email people who opted-in in their language, this is something we want to see in 2020.

Ideally we will be able to email these Italian users and point them to the Italian discourse or make a custom CTA to re-engage them.


1 Like

Thanks a lot, I appreciated a lot the italian dataset and I’m going to use it for my experiments,
After browsing the dataset for awhile, I suggest also a different use for it: the analysis of the illiteracy, updating Tullio De Mauro’s studies.
It is surprising that so many persons are unable to read correctly a phrase on the screen.
I suggest to modify the instruction on the speak form: in bold “read EXACTLY what is on the screen”

Ciao @Filippo_Davalli,
Can you reach our community on telegram so we can add this to our queue of tasks? you can find us with @mozitabot :slight_smile:

I know that there are plans for a new review of the audio recordings also to fix this issues.