Domain tagging metadata coming to Common Voice

jesslynnrose · February 6, 2024, 10:26am

Good morning all, excited to be bringing you one more change to how we will soon handle metadata on Common Voice.

We’ve had a lot of requests from dataset users for the ability to classify sentences and their associated voice clips by topic, so we’re introducing an update to the metadata we hold against sentences and their linked clips.

We’ll soon allow uploaded and contributed sentences to be tagged with one domain tag from the list below:

```
General
```
```
Agriculture and Food
```
```
Automotive and Transport
```
```
Finance
```
```
Service and Retail
```
```
Healthcare
```
```
History, Law and Governmant
```
```
Media and Entertainment
```
```
Nature and Environment
```
```
News and Current Affairs
```
```
Technology and Robotics
```

Language Fundamentals (e.g. Digits, Letters, Money)

More information in this handy blog post.

bozden · February 6, 2024, 1:36pm

Thank you for this! We greatly appreciate this feature

bozden · February 24, 2024, 6:20pm

I’ve been trying to categorize what I had in my mind with the provided options. I’m very interested to learn about the rationale used in selecting the domains provided. I have a feeling that these are somewhat related to the actual / current application areas, especially by big companies like nVidia.

Now I have some problems to actualize what I had in mind when suggesting this feature two years ago, such as:

Machine command interfaces / intent systems (smart homes, IoT devices etc.)
Subtitle generation / transcription / dictation / … in diverse areas (e.g. sports, fundamental sciences like math/chemistry/physics/material sciences/astronomy/…, many flavors of engineering, etc)

AFAIK, the main idea behind domain specific corpus is introduction of jargon, mostly technical terms, mostly Latin based - with correct spellings. These terms are mostly based on the higher education, so they should be written / spoken by people from that profession.

With the current list, if people do not know the idea behind it, they will tend to categorize them wrongly - if my above understanding is correct. For me it should work like this:

GENERAL: “If you have a fever, take an aspirin.”
HEALTHCARE: “Acetylsalicylic acid, is a nonsteroidal anti-inflammatory drug used to reduce pain, fever, and/or inflammation, and as an antithrombotic.” (taken from Wikipedia)

But, people will tend to set the first one also to “Healthcare”…

I became aware of this problem during my last teaching session, when I used the sentence “Do you have a room with a hottub?” where I was trying to use the word “jakuzi” (jakuzzi) to get better distinction between “j” and “ş” sounds in Turkish, where our models struggle. People directly said that this is “tourism” related thus, should be under “service”. I beg to differ. “I don’t have any cash with me.” is not finance.

Correct domain labeling will also help the language models, especially during setting pruning thresholds. E.g. “antithrombotic” will be pruned out in a general model, but not in a domain specific one.

I hope this list can be extended in time, even can have free-form / project based titles, so that people can create their own titles here. I also hope that the project team discusses such important features with communities before implementing them - like we did in the past.

bozden · May 18, 2024, 3:55am

When working on (ideas) for sentence domain classification models, I remembered a rather general list on Wikipedia, some knowledge accumulation through the years. Maybe this might be a starting point for this:

bozden · May 19, 2024, 12:57pm

Another possible categorization is in DMOZ:

Topic		Replies	Views
Sentences for domain-specific context Common Voice feedback	1	754	March 4, 2019
What are "domains" feature used for? Common Voice participation , sentence-collection , feedback	0	46	January 30, 2026
Special tags in spontaneous speech mode Common Voice participation , spontaneous-speech , guidelines	3	53	March 11, 2026
Common Voice 2024 Roadmap Update: Video Common Voice	1	896	May 22, 2024
Increasing participation by a Win-Win mechanism Common Voice feedback	4	1577	July 1, 2022

Domain tagging metadata coming to Common Voice

Related topics