I’ve been trying to categorize what I had in my mind with the provided options. I’m very interested to learn about the rationale used in selecting the domains provided. I have a feeling that these are somewhat related to the actual / current application areas, especially by big companies like nVidia.
Now I have some problems to actualize what I had in mind when suggesting this feature two years ago, such as:
- Machine command interfaces / intent systems (smart homes, IoT devices etc.)
- Subtitle generation / transcription / dictation / … in diverse areas (e.g. sports, fundamental sciences like math/chemistry/physics/material sciences/astronomy/…, many flavors of engineering, etc)
AFAIK, the main idea behind domain specific corpus is introduction of jargon, mostly technical terms, mostly Latin based - with correct spellings. These terms are mostly based on the higher education, so they should be written / spoken by people from that profession.
With the current list, if people do not know the idea behind it, they will tend to categorize them wrongly - if my above understanding is correct. For me it should work like this:
- GENERAL: “If you have a fever, take an aspirin.”
- HEALTHCARE: “Acetylsalicylic acid, is a nonsteroidal anti-inflammatory drug used to reduce pain, fever, and/or inflammation, and as an antithrombotic.” (taken from Wikipedia)
But, people will tend to set the first one also to “Healthcare”…
I became aware of this problem during my last teaching session, when I used the sentence “Do you have a room with a hottub?” where I was trying to use the word “jakuzi” (jakuzzi) to get better distinction between “j” and “ş” sounds in Turkish, where our models struggle. People directly said that this is “tourism” related thus, should be under “service”. I beg to differ. “I don’t have any cash with me.” is not finance.
Correct domain labeling will also help the language models, especially during setting pruning thresholds. E.g. “antithrombotic” will be pruned out in a general model, but not in a domain specific one.
I hope this list can be extended in time, even can have free-form / project based titles, so that people can create their own titles here. I also hope that the project team discusses such important features with communities before implementing them - like we did in the past.