Common Voice related toolbox: Beta testers and feature requests needed

I’m implementing an open-cource toolbox for Common Voice, and nearly finished the first two modules which are mainly statistics and visualization oriented.

The project(s) include each and every language Common Voice has. I hope these will be helpful and usable for each community, save you some from the time you spend while searching the relevant info or by processing your data.

I finished most of the ideas I have in mind, and I need your feedback on them:

  • What do you want more?
  • Any problems?
  • Are the measures correct (you can compare them with what you already have)
  • Any other ideas?

Of course you can comment below and/or use github for bug reports / feature requests and discussions area…

Below, I summarize the two completed tools. They are best viewed on large screens as they use data tables, which are hard to fit into mobile screens. More info is on github repos.

Common Voice Metadata Viewer

A serverless WebApp to visualize all datasets in CV metadata from commonvoice-dataset repo. You can see the total Common Voice sums, compare languages and/or a single language across versions, with tables and graphs (to see the graphs, you need to select a single language). It fully depends on the data provided by CV and cannot provide detailed analysis per dataset (which the next tool does).

This is useful for language communities to see their status (e.g. change in female/male ratio, validated percentages etc.) over timeline and plan for the future.

Repo: https://github.com/HarikalarKutusu/cv-tbox-metadata-viewer
Beta preview: https://cv-metadata-viewer.netlify.app/

Common Voice Dataset Analyzer

A serverless WebApp to visualize detailed information and (offline pre-calculated) statistics on splits, alternative splitting algorithms, recording durations and the current text-corpus of a single dataset (i.e. language & version). Here, you can also compare different splitting algorithms with tables and graphs. Currently we can analyze all datasets between v8.0 and v11.0, but if we can get *.tsv files from earlier versions, we can also calculate towards the beginning, which would also enable us doing analysis on user retention and similar time dependent measures.

This tool is mainly for language communities and AI experts who will model on a dataset. This enables you to see detailed measures and their distributions so that you can check the health of your dataset, e.g. if your sentences are enough or repeatedly recorded, if they are short or long, your up&down-vote rates, gender/age biasing in the dataset and many more. All with values, distributions and graphs… You can also download tables as data and graphs as png files to include in your reports/papers/thesis if needed.

Repo: https://github.com/HarikalarKutusu/cv-tbox-dataset-analyzer
Beta preview: https://cv-dataset-analyzer.netlify.app/


What will the final Toolbox look like?

I started the Toolbox as separate systems for being merged later.

The whole tooling is for the enhancement of the dataset health and quality, but will also help creating de-biased splits and models. But for this, we need a system where language communities can work as teams and annotate the datasets.

At the end it will have:

  • An offline python tooling (the “core”) for pre-calculations/preparations
  • A secure node/express server to serve the data
  • A secure and privacy focused react frontend, but to create projects, form teams, moderate data etc we must have secure logins like Common Voice does.

I’m currently halfway to implement the “Moderator” shown below, which will also form the base for the above client/server structure. Then I’ll integrate the finalized Metadata Viewer and Dataset Analyzer onto this structure.


(read here about the whole project)

4 Likes

Great stuff! The green area in the toolbox chart is of most interest for me.

What I would love to have is an api that I can use to query on TTS models, then I could integrate that in our website (https://bagrat.space)

I’m glad you liked it.

But unfortunately there is no way I can provide what you ask, as this is CV dataset related, thus STT only - so no trained models, currently server-less so no API etc :slight_smile: Unless somebody provides me a couple of $M for another Kaggle/HuggingFace-alike :smiley:

Having a public API for compiled results on the final structure is a good idea thou…

BTW, I cannot G-translate your website to see what it is doing :confused: If you are after sentences and/or mp3 recordings from API, I will not be able to provide it, they should be provided by CV. CV is against keeping the data on servers except theirs.

3 Likes

The metadata-viewer has been updated with the recently released v12.0 metadata.

2 Likes

The dataset-analyzer has been updated with detailed v12.0 data analysis.
In the meantime, I also added historic data from v1 to v7.0 for you to see the changes (except v2, which had been shortly superseeded with v3).

1 Like