Special tags in spontaneous speech mode

Transcribe → General guidance → Labeling noise events like coughing or laughing

How other noises should be labeled, if there are only 4 listed special tags right now:
[laugh], [disfluency], [unclear], [noise]. Does it mean, I should add new special tag as [coughing]?

If yes, then how should I do that. Should I just add this tag as in folksonomy or I need to suggest it here or in other place and wait until it is approved and is implemented (because you have frontend translation for special tags, which won’t be possible with random/user tags). If I can add/define a new tag by myself, can it be in my native language or it should be in English only?

If the answer is no, then how can I label other noises? Does it mean I should modify in some way the [noise] special tag? Something like [noise-coughing]/[noise_coughing]/[noise|coughing] etc?

Believe it or not, when SPS came out as Alpha first time, I asked for “[cough]” tag, as a person who does that a lot.

There is no guidelines for this, but what I can suggest is:

  • Do not overcomplicate - a dataset user would just skip those part out most of the time.
  • The [xyz] will be the main detector
  • I’d use `[cough]` label even if it is not defined
  • Open an issue as feature request for new tags if needed.

There are some papers in arxiv (and elsewhere) which detail these. We all need to read those for future plans I guess…

:slight_smile:

Of course I will not. In >95% cases it is enough for me to have the tags that are already defined. I just want to know how to behave, if one day I will need them. In addition, it doesn’t really have any sense to add too many new tags, if there is no any full list of used tags with their explanation as in the guidelines. So I don’t see any reason to do that right now.

Sorry, but what do you mean by that?

They definitely exist. The main problem as far as I know, is that this taglist always depends on the project goals, which I’m not sure are well-defined yet.

If you start to read or collect them, I think it would be good to have a separate topic here for collecting and sharing articles and papers that are relevant to CommonVoice :upside_down_face:

Exactly.

I mean, anything between […] would be regarded as tag (i.e. not representing a real speech, but something else). Important thing here is they are systematic, e.g. “errr”, “ummm” are not (how many “m”s are enough?).

In many cases (e.g. ASR training), these parts will be cut out e.g. during forced alignment. But what if a researcher is working on how people laugh?

I think (except missing cough :smiley:) these would be enough for such researchers.