Common Voice training data stats

rajpuneet.sandhu · October 26, 2018, 3:55pm

I am trying to fine tune the 0.3.0 release using Common Voice data set. The total data in this data set is 500 hours but I am not able to find a break down of this information that how many hours of data is available for each of the English accents included in this data set. This information is important for us as we are trying to make the model robust so that it works well for all the English accents (native and non native) and it will help us decide if we need more training data or not. Does anybody have information about this?

lissyx · October 26, 2018, 4:21pm

I guess the people from Common Voice would be the best suited for that.

jef.daniels · November 1, 2018, 12:35am

There is an issue here, but no clue on the progress for automatic statistics.

https://github.com/mozilla/voice-web/issues/1491

qtran · November 1, 2018, 1:41am

I thought 0.3.0 was trained using Common Voice data already. How do we fine tune it using the same data? I’m a bit confused here

rajpuneet.sandhu · November 2, 2018, 2:54pm

@qtran, it’s not clear if they have used the data for all of the English accents in Common voice corpus or just the american accent

kdavis · November 2, 2018, 3:04pm

@rajpuneet.sandhu @qtran We trained the 0.3.0 model on all validated English accents.

However, if you want to specialize to a particular accent, you could fine-tune the release model with the Common Voice subset tagged as from that accent.