[Community Feedback Request] Introducing Quality Tags for Common Voice Spontaneous Speech - We want to hear from you!

If you’ve already played around with Common Voice Spontaneous Speech, we would love to hear your thoughts on the brand‑new :sparkles: quality tags :sparkles: we’re rolling out with each language dataset! Our goal is to ensure the datasets the community builds are high-quality and easy to work with, so we’re including additional metadata to simplify data cleaning and surface potential issues.

What you can find on Spontaneous Speech version 1.0

For each row in the .tsv file we now include:

char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
    transcription-length - the audio/transcript pair is under 3 characters per second
    speech-rate - the audio/transcript pair is over 30 characters per second
    short-audio - audio length under 2 seconds
    long-audio - audio length over 30 seconds

The tags transcription-length and speech-rate can be used to identify a mismatch between the audio clip’s length and the transcription’s length. In the first case, whether the transcription length is too short for the matching audio clip length (too few characters per second for each audio second), and in the second (opposite) case, whether the transcription is too long for the audio clip length (the speech rate is too fast).

Planned future tags for later versions (Preview)

non-allowed-script - Tag for transcriptions containing a writing system not associated with the language.
mixed-script-words - Tag for transcriptions containing multiple writing systems at the word/token level.
mixed-script-transcription - Tag for transcriptions containing multiple writing systems, but each word/token consistently uses a single script.

These tags also let us generate quick overview tables so we can get a “feel” of the overall quality of each language dataset at a glance. This could look something like this for example:

Your input matters!

  1. Usefulness – Do you find these tags helpful for cleaning, analysis, or model training?

  2. Clarity – Anything confusing or ambiguous in the definitions?

  3. Missing pieces – Are there other quality signals you’d like to see?

Feel free to reply below, or send me a personal message!

Thanks for shaping the next iteration of Common Voice Spontaneous Speech :speaking_head: :heart:

4 Likes

Hello! Thank you for this new tool to play with. I haven’t encountered mixed script issues in my languages in Spontaneous Speech. There is however another issue that I would love to be able to tag : people who misunderstand the assignment and read the question instead of answering it. It still has some value, so I don’t want to report those entries and see them removed, but I would like some way to flag them so the number of “different” entries in the dataset is not misleading. Those entries could even be thrown back in the Scripted Speech datasets?

Thank you for the great work as always, team.

1 Like