If you’ve already played around with Common Voice Spontaneous Speech, we would love to hear your thoughts on the brand‑new
quality tags
we’re rolling out with each language dataset! Our goal is to ensure the datasets the community builds are high-quality and easy to work with, so we’re including additional metadata to simplify data cleaning and surface potential issues.
What you can find on Spontaneous Speech version 1.0
For each row in the .tsv file we now include:
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
transcription-length - the audio/transcript pair is under 3 characters per second
speech-rate - the audio/transcript pair is over 30 characters per second
short-audio - audio length under 2 seconds
long-audio - audio length over 30 seconds
The tags transcription-length and speech-rate can be used to identify a mismatch between the audio clip’s length and the transcription’s length. In the first case, whether the transcription length is too short for the matching audio clip length (too few characters per second for each audio second), and in the second (opposite) case, whether the transcription is too long for the audio clip length (the speech rate is too fast).
Planned future tags for later versions (Preview)
non-allowed-script - Tag for transcriptions containing a writing system not associated with the language.
mixed-script-words - Tag for transcriptions containing multiple writing systems at the word/token level.
mixed-script-transcription - Tag for transcriptions containing multiple writing systems, but each word/token consistently uses a single script.
These tags also let us generate quick overview tables so we can get a “feel” of the overall quality of each language dataset at a glance. This could look something like this for example:
Your input matters!
-
Usefulness – Do you find these tags helpful for cleaning, analysis, or model training?
-
Clarity – Anything confusing or ambiguous in the definitions?
-
Missing pieces – Are there other quality signals you’d like to see?
Feel free to reply below, or send me a personal message!
Thanks for shaping the next iteration of Common Voice Spontaneous Speech
![]()
