Non-English language stats

Hello,

is there a page or so where we can see how the recording/validation progress is for languages other than English? For example, German?

Cheers,
Z.

Not yet, but from discussions with @mhenretty I recall it should happen rather soonish, maybe during July.

We are working on adding this page now:
https://projects.invisionapp.com/share/K4FV4BPY6TX#/screens/296514715

But here are some early figures:

+-------+--------+---------+
| lang  | clips  | votes   |
+-------+--------+---------+
| en    | 541774 | 1137149 |
| fr    |  20759 |   43643 |
| de    |  20516 |   42894 |
| tr    |   1908 |    1396 |
| cy    |   1554 |     509 |
| tt    |   1006 |    1093 |
| cv    |    957 |     103 |
| br    |    636 |     716 |
| ga-IE |     42 |      40 |
| ky    |     12 |      12 |
| kab   |      0 |       0 |
+-------+--------+---------+
3 Likes

Can you update such stats, please?

Catalan language started few days ago and we are promoting collaboration among Catalan speakers. Any figure helps :grinning:

Thanks

We hope to have the new stats experience in by August.

Would it be possible to get manually-updated figures like the ones you provided above in a week or two? As you know, Catalan support has been recently launched and we expect some traction for Catalan after stirring CV around social networks/media. So having updated figures in a week or two would be helpful to see the impact.

Thanks in advance.

1 Like

Would it be possible to get manually-updated figures like the ones you provided above in a week or two?

Yup, just ping us when you would like that.

1 Like

Please, @mhenretty, update Common Voice stats :smiley:

Same as kabyle : 0 :smile:

1 Like

Here are the new totals (queried yesterday):

+-------+--------+---------+
| name  | clips  | votes   |
+-------+--------+---------+
| en    | 565522 | 1189775 |
| fr    |  29818 |   60636 |
| de    |  27994 |   59656 |
| zh-TW |  18326 |   20396 |
| ca    |   7338 |   15891 |
| kab   |   5475 |   10505 |
| tr    |   2524 |    2719 |
| tt    |   2175 |    2543 |
| cy    |   1741 |     638 |
| cv    |   1314 |    1393 |
| it    |    902 |    1357 |
| br    |    683 |    1257 |
| ky    |    396 |     635 |
| sl    |    388 |     195 |
| ga-IE |    205 |     342 |
+-------+--------+---------+
6 Likes

Thank you for the updated stats!

Both German and English have around 5-6K different sentences, if I understand the contents of /server/data/ correctly.

So, on average, we have roughly 6 different recordings of each sentence for German.
For English, we have around 95.

What is more important – recording diversity or sentence diversity?
Do we know how the 10,000 hours corpus reported by the DeepSpeech(2) authors is composed? Should we make sure to have more diversity in the English sentences?

Thank you, @mhenretty and @gregor ! These figures really help us

Hey’all, since everyone loves stats, we put some more stats onto the Language page:

More stats to come!

4 Likes

@gregor Maybe the copy could say “Total Validated Hrs”?

@kdavis yeah I agree that it might be confusing. The german translation is actually like that. Wdyt @mbranson? It makes the label take up 2 lines, you can see how that looks like in french or in german.

Yep, sounds good to me @gregor. Let’s just ensure if the label is breaking to two lines that the count breaks to two lines as well. This is more consistent in French than it is German at the moment.