Common Voice Toolbox: Beta testers and feature requests needed

bozden · November 4, 2024, 8:20am

Direct Links:
Common Voice Metadata Viewer (Beta test site)
Common Voice Dataset Analyzer (Beta test site)

I’m implementing an open-source toolbox for Common Voice, and nearly finished the first two modules which are mainly statistics and visualization oriented.

The project(s) include each and every language Common Voice has. I hope these will be helpful and usable for each community, save you some from the time you spend while searching the relevant info or by processing your data.

I finished most of the ideas I have in mind, and I need your feedback on them:

What do you want more?
Any problems?
Are the measures correct (you can compare them with what you already have)
Any other ideas?

Of course you can comment below and/or use github for bug reports / feature requests and discussions area…

Below, I summarize the two completed tools. They are best viewed on large screens as they use data tables, which are hard to fit into mobile screens. More info is on github repos.

Common Voice Metadata Viewer

A serverless WebApp to visualize all datasets in CV metadata from commonvoice-dataset repo. You can see the total Common Voice sums, compare languages and/or a single language across versions, with tables and graphs (to see the graphs, you need to select a single language). It fully depends on the data provided by CV and cannot provide detailed analysis per dataset (which the next tool does).

This is useful for language communities to see their status (e.g. change in female/male ratio, validated percentages etc.) over timeline and plan for the future.

Repo: https://github.com/HarikalarKutusu/cv-tbox-metadata-viewer
Actual site: https://metadata.cv-toolbox.web.tr/
Beta test site: https://cv-metadata-viewer.netlify.app/

Common Voice Dataset Analyzer

A serverless WebApp to visualize detailed information and (offline pre-calculated) statistics on splits, alternative splitting algorithms, recording durations and the current text-corpus of a single dataset (i.e. language & version). Here, you can also compare different splitting algorithms with tables and graphs. Includes all final versions. *[Old: Currently we can analyze all datasets between v8.0 and v11.0, but if we can get .tsv files from earlier versions, we can also calculate towards the beginning, which would also enable us doing analysis on user retention and similar time dependent measures.]

This tool is mainly for language communities and AI experts who will model on a dataset. This enables you to see detailed measures and their distributions so that you can check the health of your dataset, e.g. if your sentences are enough or repeatedly recorded, if they are short or long, your up&down-vote rates, gender/age biasing in the dataset and many more. All with values, distributions and graphs… You can also download tables as data and graphs as png files to include in your reports/papers/thesis if needed.

Repo: https://github.com/HarikalarKutusu/cv-tbox-dataset-analyzer
Actual site: https://analyzer.cv-toolbox.web.tr/
Beta test site: https://cv-dataset-analyzer.netlify.app/

What will the final Toolbox look like?

I started the Toolbox as separate systems for being merged later.

The whole tooling is for the enhancement of the dataset health and quality, but will also help creating de-biased splits and models. But for this, we need a system where language communities can work as teams and annotate the datasets.

At the end it will have:

An offline python tooling (the “core”) for pre-calculations/preparations
A secure node/express server to serve the data
A secure and privacy focused react frontend, but to create projects, form teams, moderate data etc we must have secure logins like Common Voice does.

I’m currently halfway to implement the “Moderator” shown below, which will also form the base for the above client/server structure. Then I’ll integrate the finalized Metadata Viewer and Dataset Analyzer onto this structure.

(read here about the whole project)

daniel.abzakh · December 6, 2022, 6:33pm

Great stuff! The green area in the toolbox chart is of most interest for me.

What I would love to have is an api that I can use to query on TTS models, then I could integrate that in our website (https://bagrat.space)

bozden · December 6, 2022, 7:20pm

I’m glad you liked it.

But unfortunately there is no way I can provide what you ask, as this is CV dataset related, thus STT only - so no trained models, currently server-less so no API etc Unless somebody provides me a couple of $M for another Kaggle/HuggingFace-alike

Having a public API for compiled results on the final structure is a good idea thou…

BTW, I cannot G-translate your website to see what it is doing If you are after sentences and/or mp3 recordings from API, I will not be able to provide it, they should be provided by CV. CV is against keeping the data on servers except theirs.

bozden · December 19, 2022, 10:39am

The metadata-viewer has been updated with the recently released v12.0 metadata.

bozden · December 28, 2022, 3:11pm

The dataset-analyzer has been updated with detailed v12.0 data analysis.
In the meantime, I also added historic data from v1 to v7.0 for you to see the changes (except v2, which had been shortly superseeded with v3).

bozden · July 7, 2023, 2:12pm

Apparently, I missed informing you about v13.0 updates…

Now, v14.0 updates are online. This took a bit longer than expected because (1) I implemented upgrades from delta versions (2) I used the durations from distributions (3) pandas v2.x upgrade needed some adjustments. So I spent a good amount of time on coding.

Here are the resources for your review:

Common Voice Metadata Viewer (Beta Mirror).

Common Voice Dataset Analyzer (Beta Mirror)

bozden · September 15, 2023, 12:54pm

Updated tools to include Common Voice v15.0 datasets.

Common Voice Metadata Viewer (Beta Mirror).

Common Voice Dataset Analyzer (Beta Mirror)

bozden · January 8, 2024, 9:47pm

As some of you might already know, CV v16.0 datasets came out with some problem data (zero sized recordings and thus duration data, also effecting the metadata), so v16.1 is released recently. Before I could inform you on v16.0, I had to re-update the webapps with Common Voice v16.1 datasets. Here they are:

Common Voice Metadata Viewer (Beta Mirror).

Common Voice Dataset Analyzer (Beta Mirror)

bozden · March 22, 2024, 10:12pm

I updated the Metadata Viewer for v17.0. Here:

I had to revert values in the detailed gender info back to the originals to be comparable with prior versions (of course “other” became all zeros)
I added basic text corpus page, which also gives basic domain info (no details for now)
In the coming releases, when more data becomes available, I’ll detail them.

Common Voice Metadata Viewer (Beta Mirror).

The Dataset Analyzer needs much more work, with the excellent changes which came in v17.0, especially text corpora. I need to re-write the related code. In the previous version, a complete analysis took about 24 hours on my computer, I hope with the current release I can make it faster.

So, bare with me…

jesslynnrose · March 26, 2024, 3:58pm

This is just such a massive undertaking and thank you so much for your continued work on this!

bozden · April 9, 2024, 11:33am

I finished re-writing the code for the Dataset Analyzer and published it on BETA site, I’ll add more tables/graphs (domains etc) and continue to post in the beta site for now.

For now, with the sentence_id field (which is used to index sentences) my calculations for the whole dropped from 24 hours to under 4 hours. In addition we will have a correct/full text-corpora analysis in the upcoming releases (not yet).

There are a few downsides/changes thou:

There are bugs in the new validated_sentences.tsv and I opened several issues in github (See 1, 2 and 3 - the first one is critical). I tried to remedy them in code to some extend, but not all of them.
For the former releases (<v17.0), we can only get sentence_id’s using sentences, but the sentences got pre-processed in CorporaCreator, so they can have changes. So I could not get the whole text-corpus for these for now, I need to re-implement these in the code.
And of course anything between v14.0 - v16.1 will be incomplete (as anything entered through the web interface/write is not there).
One major change in my text-corpus analysis: I’ve been using the whole sentences used recording in buckets/splits and got what is in voice corpus. Now I use unique sentences, regardless of how many times they are recorded - so I now use the text-corpus. This seems more logical.

I think we will get a better result with v18.0 when these are fixed.

Addition (April 9th):

I think I could handle all errors in the validated_sentences.tsv. Unfortunately the same malformed rows also exist in reported.tsv, I could also handle them with another custom parser, but unfortunately not all.

I updated the main site with new version. I also added results for CorporaCreator -s 5 case (s5 algorithm), where up to 5 recordings of the same sentence is enabled.

bozden · July 4, 2024, 8:18pm

Common Voice v18.0 update(s)

These are missing as of posting this:

cv-dataset repo is not updated, so I cannot process it for Metadata Viewer
Georgian v18.0 dataset is missing, so I could not include it.

Between v17.0 and v18.0 a related change was the naming and structure of sentence domains. Namings (contextual grouping) changed, and we can now add up to 3 different domains, so I had to deal with these.

2024-06-26: I only updated the Dataset Analyzer beta site, I’ll update the main site with the release of Georgian dataset.

2024-07-04: Georgian v18.0 is out and cv-dataset repo is updated. So all tooling is updated for v18.0.

bozden · November 4, 2024, 8:16am

Common Voice v19.0 Update(s)

2024-09-20: Updates to Metadata Viewer

Updated data. Also the localized language names come back after the bug is fixed.
Added “Version Delta” page for you to see what changed between version. I find it useful in my line of work.

2024-10-03: Updated Dataset Analyzer beta site with v19.0. There are still some discrepancies (some new datasets with empty validated show like they have splits).

This time I used delta release workflow and implemented a delta_merge.py script (followed by CorporaCreator -s 1 algorithm) to create default scripts in the cv-tbox-split-maker.
Finished the below mentioned audio analysis, but I only have raw data now, I have to add statistics, visualization, check results etc. More work is needed there…

PS: Dataset Analyzer will take some time, as I’m working on large amount of audio for errors, VAD, and SNR calculations to report as statistics.

2024-11-01: Updated Dataset Analyzer beta site with Audio Analysis statistics. I’ll be checking the results and extend that page a bit more. Feel free to comment and/or request new measures/statistics. I addition to these I have the following to share:

Full and/or dataset based list of clip errors (tarfile, PyAV & torchaudio - tagged in this order)
List of audio specs for “problem” recordings (no VAD/speech detected and recordings with low SNR - which means very noisy). I checked some of them, they seem to be OK wrt to detection, but it will depend on your application. Most of them can be intelligible for humans, but can be bad for machines.

Here is the zip to download. It has a single file for all errors tagged with lc & version, along with reason (at what stage the error came out). Under ver/lc directories, you can find audio_bad_*.tsv files (one per validated, invalidated, other), so that you can see which of these “bad recordings” got downvoted and moved to “invalidated”.

The statistics are given on the Dataset Analyzer, Audio Analysis tab. I mainly used torchaudio & silero-VAD (with default values) in the process.

If we look at ca v19.0 for example:

We have no clip reading errors (errors cease to exist after v16.1)
Many “No VAD” (silero-vad did not detect speech - with default settings) in all “clips”, but most of them got downvoted by the community and now reside in “invalidated”, but there are many more in “other”, waiting to be “invalidated”.
There are some “Low Power” recordings (power of speech part, mainly whispering - I just counted all with power 1e-6 or less), again many of them are invalidated, but some passed into validated.
There are “Low SNR” recordings (SNR is simply a relative estimate, calculated as log10(vad_power/silence_power), where power is sum-squares of individual samples’ amplitudes). As many devices today have noise removal hardware and/or software (maybe at driver level), we just cannot use “more mathematically correct methods/measures” because it is not a controlled environment. We just counted all recordings with SNR value below 0 (negatives), which would mean the noise is higher than the speech. We did NOT subtract the noise floor from speech part, as most devices cut background or technical noises -like hissing- when they detect speech.
The “Duration” measured here might be slightly different than the duration given by Common Voice in clip_durations.tsv file, because each library can use different algorithms for measuring. You can compare it with the Duration tab.
"VAD Duration" is just the length of speech part (silero-vad puts 30ms pre/post leaders to detected speech).
"VAD %" is speech duration divided by total duration in percent, and it is usually between 50-70%. This excludes silences at the beginning and at the end of the recordings, plus any silence between words which is larger than 100ms (e.g. commas, breathing in etc).

Many distributions for those values can be seen when you expand a bucket/split, like:

It took me about two months to get these, so I hope it has some use.

2024-11-04: Updated the main Dataset Analyzer site (see top post).