Disfluencies

Libra · March 10, 2026, 10:47am

According to the guidelilnes

Transcribe → General guidance → Writing down disfluencies, including hesitations and repetitions

In the same time:

Transcribe → Special Tags → [disfluency] A filler word or sound used as a placeholder whilst a speaker decides what to say. In English, some common hesitation sounds are “err”, “um”, “huh”, etc.

Does it mean I need just to use [disfluency] tag every time I meet disfluency without trying to recognize which exactly word was used?

[disfluency], I don’t [disfluency] know [disfluency] what I can add here.

Or if I can recognize them, I should write them as words and only if I can’t recognize or it is some unstandard sound, I should use this tag

U-um, I don’t, huh, know, [disluency], what I can add here.

Or even something like that?

[disluency|U-um], I don’t, [disluency|huh], know, [disfluency|um], what I can add here.

What should be used/preferred?

bozden · March 10, 2026, 11:37am

Hi @Libra, as you were aware of, that tagging par in SPS was not very well defined, partially intentionally, to see what people would do. As you might know, natural speech datasets are rather new, and MCV was also experimenting to find the correct methodology which would be valid for hundreds of languages and diverse users.

From the dataset consumer point of view, one would need to know what part of “speech” does not really correspond to real speech (i.e. meaningful words). There was also a need for consistency, this was why there are 4 tags, and we asked for English wording (knowing the problems with non-Latin keyboards)…

After the initial release came out, the data has been analyzed and a QA (Quality Assurance) script was written to pre-process the data before bundling it. Part of this pipeline was replacing “err” “ummm” like transcriptions into corresponding tags i.e. `[disfluency]`.

In the last few weeks, we embedded that pipeline into the actual release process as a pipeline step, so not a post/pre-process.

We also upgraded editors in the transcription interface, so when somebody presses the bracket character, a pop-over would come out and the user can select it. The tag then would be inserted in English.

So, we expect any disfluency to be tagged as disfluency.

I guess this would be the correct usage, but I’m sure users will still be creative

Libra · March 11, 2026, 9:07pm

Hi, do you mean contributors or dataset users?

Does it mean in the final dataset version you try to replace these words in transcriptions which include them to[disfluency] tag automatically? If so, I have a new pack of questions and topics:

How do you choose, which words should be replaced? For instance, the word ‘well’ or ‘so’ can be used as disfluency and as usual meaningful words in different contexts, while ‘um’ or ‘er’ usually can’t.
Does it work only with English language or will be used with other languages as well?
I think disfluencies and their word transcriptions are essential as well, but maybe less important for most dataset users. This is the reason why I started to use [disfluency|disfluency transcription] syntax in Russian language. If someone is interested in disfluency transcriptions, they can use them. If they aren’t, then they can fully ignore/delete the part after | symbol and use just [disfluency] tag. But if my practice is undesirable, then of course I will stop it.

How does it influence the dataset and the project?

Of course! Especially when it is not clearly specified in the guidelines XD

bozden · March 11, 2026, 10:17pm

I meant contributors. In the first phase of the data collection, most of the transcribers were paid or volunteering professionals.

The QA team scanned all 58 languages for such occurrences and created a mapping. Which also included typos (e.g. disfluency is not an easy word to type)…

I think nowadays they should just type “[“ and select from popover menu (which would be translated if translated in Pontoon)…

I meant, very few people read the guidelines, homo sapiens became a 4 sec attention genus

Actually, I guess it would not, it will behave the same, only the place changed and became more efficient for us

Why don’t you join the monthly community meetings to talk about these stuff? We all want to make it better

Libra · March 23, 2026, 4:21pm

Maybe, but it can be helpful to have something like “Thanks for your interest! Please, read the guidelines before you start to contribute”, when someone opens spontaneous speech for the first time. Or transcribe/check tabs specifically, because it is more more critical there than while recording.

First of all, I didn’t know they exist

After that, I’m not sure I will have the time, but at least I can try. Where and when do you have them?

bozden · March 23, 2026, 4:30pm

Yes, we will be re-designing the workflow to make sure they got read.

The meetings are announced in these channels (Discourse & Matrix).

For immediate talks come to Matrix chat

Topic		Replies	Views
Special tags in spontaneous speech mode Common Voice participation , spontaneous-speech , guidelines	3	80	March 11, 2026
Release live: MCV Scripted Speech 24.0 and Spontaneous Speech 2.0 Common Voice	1	437	December 16, 2025
Spontaneous Speech Mode is Coming to Common Voice Common Voice	7	578	September 19, 2025
SpS. Questions and Problems in English Dataset Common Voice feedback , issue , dataset , spontaneous-speech , quality-issue	6	131	March 27, 2026
[Community Feedback Request] Introducing Quality Tags for Common Voice Spontaneous Speech - We want to hear from you! Common Voice feedback	1	146	December 17, 2025

Disfluencies

Related topics