I’m implementing an open-cource toolbox for Common Voice, and nearly finished the first two modules which are mainly statistics and visualization oriented.
The project(s) include each and every language Common Voice has. I hope these will be helpful and usable for each community, save you some from the time you spend while searching the relevant info or by processing your data.
I finished most of the ideas I have in mind, and I need your feedback on them:
- What do you want more?
- Any problems?
- Are the measures correct (you can compare them with what you already have)
- Any other ideas?
Of course you can comment below and/or use github for bug reports / feature requests and discussions area…
Below, I summarize the two completed tools. They are best viewed on large screens as they use data tables, which are hard to fit into mobile screens. More info is on github repos.
Common Voice Metadata Viewer
A serverless WebApp to visualize all datasets in CV metadata from commonvoice-dataset repo. You can see the total Common Voice sums, compare languages and/or a single language across versions, with tables and graphs (to see the graphs, you need to select a single language). It fully depends on the data provided by CV and cannot provide detailed analysis per dataset (which the next tool does).
This is useful for language communities to see their status (e.g. change in female/male ratio, validated percentages etc.) over timeline and plan for the future.
Common Voice Dataset Analyzer
A serverless WebApp to visualize detailed information and (offline pre-calculated) statistics on splits, alternative splitting algorithms, recording durations and the current text-corpus of a single dataset (i.e. language & version). Here, you can also compare different splitting algorithms with tables and graphs. Currently we can analyze all datasets between v8.0 and v11.0, but if we can get *.tsv files from earlier versions, we can also calculate towards the beginning, which would also enable us doing analysis on user retention and similar time dependent measures.
This tool is mainly for language communities and AI experts who will model on a dataset. This enables you to see detailed measures and their distributions so that you can check the health of your dataset, e.g. if your sentences are enough or repeatedly recorded, if they are short or long, your up&down-vote rates, gender/age biasing in the dataset and many more. All with values, distributions and graphs… You can also download tables as data and graphs as png files to include in your reports/papers/thesis if needed.
What will the final Toolbox look like?
I started the Toolbox as separate systems for being merged later.
The whole tooling is for the enhancement of the dataset health and quality, but will also help creating de-biased splits and models. But for this, we need a system where language communities can work as teams and annotate the datasets.
At the end it will have:
- An offline python tooling (the “core”) for pre-calculations/preparations
- A secure node/express server to serve the data
- A secure and privacy focused react frontend, but to create projects, form teams, moderate data etc we must have secure logins like Common Voice does.
I’m currently halfway to implement the “Moderator” shown below, which will also form the base for the above client/server structure. Then I’ll integrate the finalized Metadata Viewer and Dataset Analyzer onto this structure.
(read here about the whole project)