Common Voice Toolbox: Beta testers and feature requests needed

I’m implementing an open-source toolbox for Common Voice, and nearly finished the first two modules which are mainly statistics and visualization oriented.

The project(s) include each and every language Common Voice has. I hope these will be helpful and usable for each community, save you some from the time you spend while searching the relevant info or by processing your data.

I finished most of the ideas I have in mind, and I need your feedback on them:

  • What do you want more?
  • Any problems?
  • Are the measures correct (you can compare them with what you already have)
  • Any other ideas?

Of course you can comment below and/or use github for bug reports / feature requests and discussions area…

Below, I summarize the two completed tools. They are best viewed on large screens as they use data tables, which are hard to fit into mobile screens. More info is on github repos.

Common Voice Metadata Viewer

A serverless WebApp to visualize all datasets in CV metadata from commonvoice-dataset repo. You can see the total Common Voice sums, compare languages and/or a single language across versions, with tables and graphs (to see the graphs, you need to select a single language). It fully depends on the data provided by CV and cannot provide detailed analysis per dataset (which the next tool does).

This is useful for language communities to see their status (e.g. change in female/male ratio, validated percentages etc.) over timeline and plan for the future.

Repo: https://github.com/HarikalarKutusu/cv-tbox-metadata-viewer
Actual site: https://metadata.cv-toolbox.web.tr/
Beta test site: https://cv-metadata-viewer.netlify.app/

Common Voice Dataset Analyzer

A serverless WebApp to visualize detailed information and (offline pre-calculated) statistics on splits, alternative splitting algorithms, recording durations and the current text-corpus of a single dataset (i.e. language & version). Here, you can also compare different splitting algorithms with tables and graphs. Includes all final versions. *[Old: Currently we can analyze all datasets between v8.0 and v11.0, but if we can get .tsv files from earlier versions, we can also calculate towards the beginning, which would also enable us doing analysis on user retention and similar time dependent measures.]

This tool is mainly for language communities and AI experts who will model on a dataset. This enables you to see detailed measures and their distributions so that you can check the health of your dataset, e.g. if your sentences are enough or repeatedly recorded, if they are short or long, your up&down-vote rates, gender/age biasing in the dataset and many more. All with values, distributions and graphs… You can also download tables as data and graphs as png files to include in your reports/papers/thesis if needed.

Repo: https://github.com/HarikalarKutusu/cv-tbox-dataset-analyzer
Actual site: https://analyzer.cv-toolbox.web.tr/
Beta test site: https://cv-dataset-analyzer.netlify.app/


What will the final Toolbox look like?

I started the Toolbox as separate systems for being merged later.

The whole tooling is for the enhancement of the dataset health and quality, but will also help creating de-biased splits and models. But for this, we need a system where language communities can work as teams and annotate the datasets.

At the end it will have:

  • An offline python tooling (the “core”) for pre-calculations/preparations
  • A secure node/express server to serve the data
  • A secure and privacy focused react frontend, but to create projects, form teams, moderate data etc we must have secure logins like Common Voice does.

I’m currently halfway to implement the “Moderator” shown below, which will also form the base for the above client/server structure. Then I’ll integrate the finalized Metadata Viewer and Dataset Analyzer onto this structure.


(read here about the whole project)

5 Likes

Great stuff! The green area in the toolbox chart is of most interest for me.

What I would love to have is an api that I can use to query on TTS models, then I could integrate that in our website (https://bagrat.space)

I’m glad you liked it.

But unfortunately there is no way I can provide what you ask, as this is CV dataset related, thus STT only - so no trained models, currently server-less so no API etc :slight_smile: Unless somebody provides me a couple of $M for another Kaggle/HuggingFace-alike :smiley:

Having a public API for compiled results on the final structure is a good idea thou…

BTW, I cannot G-translate your website to see what it is doing :confused: If you are after sentences and/or mp3 recordings from API, I will not be able to provide it, they should be provided by CV. CV is against keeping the data on servers except theirs.

3 Likes

The metadata-viewer has been updated with the recently released v12.0 metadata.

2 Likes

The dataset-analyzer has been updated with detailed v12.0 data analysis.
In the meantime, I also added historic data from v1 to v7.0 for you to see the changes (except v2, which had been shortly superseeded with v3).

1 Like

Apparently, I missed informing you about v13.0 updates…

Now, v14.0 updates are online. This took a bit longer than expected because (1) I implemented upgrades from delta versions (2) I used the durations from distributions (3) pandas v2.x upgrade needed some adjustments. So I spent a good amount of time on coding.

Here are the resources for your review:

Common Voice Metadata Viewer (Beta Mirror).

Common Voice Dataset Analyzer (Beta Mirror)

1 Like

Updated tools to include Common Voice v15.0 datasets.

Common Voice Metadata Viewer (Beta Mirror).

Common Voice Dataset Analyzer (Beta Mirror)

As some of you might already know, CV v16.0 datasets came out with some problem data (zero sized recordings and thus duration data, also effecting the metadata), so v16.1 is released recently. Before I could inform you on v16.0, I had to re-update the webapps with Common Voice v16.1 datasets. Here they are:

Common Voice Metadata Viewer (Beta Mirror).

Common Voice Dataset Analyzer (Beta Mirror)

1 Like

I updated the Metadata Viewer for v17.0. Here:

  • I had to revert values in the detailed gender info back to the originals to be comparable with prior versions (of course “other” became all zeros)
  • I added basic text corpus page, which also gives basic domain info (no details for now)
    In the coming releases, when more data becomes available, I’ll detail them.

Common Voice Metadata Viewer (Beta Mirror).

The Dataset Analyzer needs much more work, with the excellent changes which came in v17.0, especially text corpora. I need to re-write the related code. In the previous version, a complete analysis took about 24 hours on my computer, I hope with the current release I can make it faster.

So, bare with me…

1 Like

This is just such a massive undertaking and thank you so much for your continued work on this!

1 Like

I finished re-writing the code for the Dataset Analyzer and published it on BETA site, I’ll add more tables/graphs (domains etc) and continue to post in the beta site for now.

For now, with the sentence_id field (which is used to index sentences) my calculations for the whole dropped from 24 hours to under 4 hours. In addition we will have a correct/full text-corpora analysis in the upcoming releases (not yet).

There are a few downsides/changes thou:

  1. There are bugs in the new validated_sentences.tsv and I opened several issues in github (See 1, 2 and 3 - the first one is critical). I tried to remedy them in code to some extend, but not all of them.
  2. For the former releases (<v17.0), we can only get sentence_id’s using sentences, but the sentences got pre-processed in CorporaCreator, so they can have changes. So I could not get the whole text-corpus for these for now, I need to re-implement these in the code.
  3. And of course anything between v14.0 - v16.1 will be incomplete (as anything entered through the web interface/write is not there).
  4. One major change in my text-corpus analysis: I’ve been using the whole sentences used recording in buckets/splits and got what is in voice corpus. Now I use unique sentences, regardless of how many times they are recorded - so I now use the text-corpus. This seems more logical.

I think we will get a better result with v18.0 when these are fixed.


Addition (April 9th):

I think I could handle all errors in the validated_sentences.tsv. Unfortunately the same malformed rows also exist in reported.tsv, I could also handle them with another custom parser, but unfortunately not all.

I updated the main site with new version. I also added results for CorporaCreator -s 5 case (s5 algorithm), where up to 5 recordings of the same sentence is enabled.

Common Voice v18.0 update(s)

These are missing as of posting this:

  • cv-dataset repo is not updated, so I cannot process it for Metadata Viewer
  • Georgian v18.0 dataset is missing, so I could not include it.

Between v17.0 and v18.0 a related change was the naming and structure of sentence domains. Namings (contextual grouping) changed, and we can now add up to 3 different domains, so I had to deal with these.

2024-06-26: I only updated the Dataset Analyzer beta site, I’ll update the main site with the release of Georgian dataset.

2024-07-04: Georgian v18.0 is out and cv-dataset repo is updated. So all tooling is updated for v18.0.

1 Like