Non-English language stats


(Zeno Gantner) #1

Hello,

is there a page or so where we can see how the recording/validation progress is for languages other than English? For example, German?

Cheers,
Z.


(Lissyx) #2

Not yet, but from discussions with @mhenretty I recall it should happen rather soonish, maybe during July.


(Michael Henretty) #3

We are working on adding this page now:
https://projects.invisionapp.com/share/K4FV4BPY6TX#/screens/296514715

But here are some early figures:

+-------+--------+---------+
| lang  | clips  | votes   |
+-------+--------+---------+
| en    | 541774 | 1137149 |
| fr    |  20759 |   43643 |
| de    |  20516 |   42894 |
| tr    |   1908 |    1396 |
| cy    |   1554 |     509 |
| tt    |   1006 |    1093 |
| cv    |    957 |     103 |
| br    |    636 |     716 |
| ga-IE |     42 |      40 |
| ky    |     12 |      12 |
| kab   |      0 |       0 |
+-------+--------+---------+

(Joan Montané) #4

Can you update such stats, please?

Catalan language started few days ago and we are promoting collaboration among Catalan speakers. Any figure helps :grinning:

Thanks


(Michael Henretty) #5

We hope to have the new stats experience in by August.


(Joan Montané) #6

Would it be possible to get manually-updated figures like the ones you provided above in a week or two? As you know, Catalan support has been recently launched and we expect some traction for Catalan after stirring CV around social networks/media. So having updated figures in a week or two would be helpful to see the impact.

Thanks in advance.


(Michael Henretty) #7

Would it be possible to get manually-updated figures like the ones you provided above in a week or two?

Yup, just ping us when you would like that.


(Joan Montané) #8

Please, @mhenretty, update Common Voice stats :smiley:


(Muḥend Belqasem) #9

Same as kabyle : 0 :smile:


(Gregor) #10

Here are the new totals (queried yesterday):

+-------+--------+---------+
| name  | clips  | votes   |
+-------+--------+---------+
| en    | 565522 | 1189775 |
| fr    |  29818 |   60636 |
| de    |  27994 |   59656 |
| zh-TW |  18326 |   20396 |
| ca    |   7338 |   15891 |
| kab   |   5475 |   10505 |
| tr    |   2524 |    2719 |
| tt    |   2175 |    2543 |
| cy    |   1741 |     638 |
| cv    |   1314 |    1393 |
| it    |    902 |    1357 |
| br    |    683 |    1257 |
| ky    |    396 |     635 |
| sl    |    388 |     195 |
| ga-IE |    205 |     342 |
+-------+--------+---------+

(Zeno Gantner) #11

Thank you for the updated stats!

Both German and English have around 5-6K different sentences, if I understand the contents of /server/data/ correctly.

So, on average, we have roughly 6 different recordings of each sentence for German.
For English, we have around 95.

What is more important – recording diversity or sentence diversity?
Do we know how the 10,000 hours corpus reported by the DeepSpeech(2) authors is composed? Should we make sure to have more diversity in the English sentences?


(Joan Montané) #12

Thank you, @mhenretty and @gweber ! These figures really help us


(Gregor) #13

Hey’all, since everyone loves stats, we put some more stats onto the Language page:

More stats to come!


(kdavis) #14

@gweber Maybe the copy could say “Total Validated Hrs”?


(Gregor) #15

@kdavis yeah I agree that it might be confusing. The german translation is actually like that. Wdyt @mbranson? It makes the label take up 2 lines, you can see how that looks like in french or in german.


(Megan Branson) #16

Yep, sounds good to me @gweber. Let’s just ensure if the label is breaking to two lines that the count breaks to two lines as well. This is more consistent in French than it is German at the moment.