We’re going to stop all feature releases to work on the Common Voice Infrastructure for the next couple months. The full scope of this is still being discussed.
What does this mean and what things are we prioritizing?
Limiting site downtime!
Automating dataset releases!
Site accessibility!
And more
We are excited to give our engineers the time they need to make the site better for everyone and provide the community with the information they need to get the best datasets possible.
Email Newsletters
Email implementation
We would like to be able to offer localized emails to our contributors and are working to make that happen
Open Voice Data Challenge Pilot partner launch
The Common Voice team has launched a partner pilot to look at how competition and incentives help increase the quantity and quality of data received. To do this we worked with three other partner companies SAP, IBM, Lenovo and a small number of new contributors from those companies. These contributors are currently in week two of a three week challenge. Once we receive analyze the results from the challenge, we will decide if it makes sense to roll out this initiative to a larger group.
Do you have more information on the plan for this? I’m concerned about file sizes as the dataset grows - English is already 30 GB at around 700 hours, so a 10k hour dataset will be over 400 GB. If you’ve already downloaded the dataset before, it’s annoying to redownload 400 GB just to get the few GBs that are new in the update.
I think the best option is to have regularly scheduled large dataset releases (quarterly?), then once someone has the dataset they can run a script to download only files that are new. That way we could have nightly/weekly updates without it being a big strain on both users’ connections and Mozilla’s bandwidth bill.
Currently we do not. The engineers are looking into the best way to do format this for our systems. We have just started looking at the infrastructure and realistically we’re looking at early 2020 for a release. We’ll keep updating everyone with progress as we have more information.
the sum of the acoustic model logit values for each timestep/character that contributed to the creation of this transcription.
Instead of using the sum or confidence values without context let’s train authentic context-aware models of listener intelligbility. Is providing language learners with contextual intelligibility awareness worth ~120 FTE hours of effort?