Domain tagging metadata coming to Common Voice

Good morning all, excited to be bringing you one more change to how we will soon handle metadata on Common Voice.

We’ve had a lot of requests from dataset users for the ability to classify sentences and their associated voice clips by topic, so we’re introducing an update to the metadata we hold against sentences and their linked clips.

We’ll soon allow uploaded and contributed sentences to be tagged with one domain tag from the list below:

  • General
  • Agriculture and Food
  • Automotive and Transport
  • Finance
  • Service and Retail
  • Healthcare
  • History, Law and Governmant
  • Media and Entertainment
  • Nature and Environment
  • News and Current Affairs
  • Technology and Robotics
  • Language Fundamentals (e.g. Digits, Letters, Money)

More information in this handy blog post.


Thank you for this! We greatly appreciate this feature :star_struck:

I’ve been trying to categorize what I had in my mind with the provided options. I’m very interested to learn about the rationale used in selecting the domains provided. I have a feeling that these are somewhat related to the actual / current application areas, especially by big companies like nVidia.

Now I have some problems to actualize what I had in mind when suggesting this feature two years ago, such as:

  • Machine command interfaces / intent systems (smart homes, IoT devices etc.)
  • Subtitle generation / transcription / dictation / … in diverse areas (e.g. sports, fundamental sciences like math/chemistry/physics/material sciences/astronomy/…, many flavors of engineering, etc)

AFAIK, the main idea behind domain specific corpus is introduction of jargon, mostly technical terms, mostly Latin based - with correct spellings. These terms are mostly based on the higher education, so they should be written / spoken by people from that profession.

With the current list, if people do not know the idea behind it, they will tend to categorize them wrongly - if my above understanding is correct. For me it should work like this:

  • GENERAL: “If you have a fever, take an aspirin.”
  • HEALTHCARE: “Acetylsalicylic acid, is a nonsteroidal anti-inflammatory drug used to reduce pain, fever, and/or inflammation, and as an antithrombotic.” (taken from Wikipedia)

But, people will tend to set the first one also to “Healthcare”…

I became aware of this problem during my last teaching session, when I used the sentence “Do you have a room with a hottub?” where I was trying to use the word “jakuzi” (jakuzzi) to get better distinction between “j” and “ş” sounds in Turkish, where our models struggle. People directly said that this is “tourism” related thus, should be under “service”. I beg to differ. “I don’t have any cash with me.” is not finance.

Correct domain labeling will also help the language models, especially during setting pruning thresholds. E.g. “antithrombotic” will be pruned out in a general model, but not in a domain specific one.

I hope this list can be extended in time, even can have free-form / project based titles, so that people can create their own titles here. I also hope that the project team discusses such important features with communities before implementing them - like we did in the past.

When working on (ideas) for sentence domain classification models, I remembered a rather general list on Wikipedia, some knowledge accumulation through the years. Maybe this might be a starting point for this:

Another possible categorization is in DMOZ: