Hi @kodyglock… Yes, it will work. After that two native speakers should review and accept it (one can be yourself).
Sentences validated in the Sentence Collector are exported automatically every week, and will be included to main Common Voice (record/listen) every two weeks - normally.
Common Voice is better with conversational language and many conversations include single-word questions or answers. It would be a shame to not include them.
It will be not wise to dump the dictionary thou
Happy new year
Yes, it will work. After that two native speakers should review and accept it (one can be yourself).
I would heavily advice against having single words. This makes the whole recording process super boring for contributors.
@kodyglock let’s add some fun and interesting sentences, so contributors recording don’t need to feel like machines
many conversations include single-word questions or answers. It would be a shame to not include them.
Yes, but for speech recognition these words do not need to be single in a recording. It’s way more fun if those words are part of a bigger sentence.
We also recently removed single words from Cantonese as these were super boring.
I beg to differ.
I think the main reason people tend to include 5-10 word sentences in the past was that the Language Model in Deepspeech (and in Coqui) is using 5-gram models.
As I mentioned above, many utterances in our everyday conversations include less than 5 words, which mostly include single words. For example, if you are commanding a machine, if you are asking a specific thing and getting answers…
There is nothing wrong with single words. Yes, you won’t dump the vocabulary, but anything conversational will be OK IMHO.
- Coffee?
- Yes!
The paradigm has been shifting to edge computing where the Acoustic Models get more importance and simple/short sentences are dominant.
We also recently removed single words from Cantonese as these were super boring.
Nope, I Google translated them at that time, they were non-conversational, few were single words, except a couple of them could be OK. Also, they were Mandarin…
The OP is asking about a single word. Boring would not be a problem. If you examine them, many language datasets include single-word sentences.
What I would advise thou:
If you are adding many short sentences, you need to counter-balance them with longer ones: 1000 short ones + 1000 longer sentences from a book. Mix them randomly before posting.
This is what I’ve been doing. I analyze every book/subset for this purpose. You can see my analyses for every resource I added here (in Turkish, but I point you to the tables):
Hey Kody,
Welcome to the community and thanks for your questions. My name is Hillary, I’m part of the Common Voice Team based at Mozilla as the community manager.
I agree with everything Micheal has explained, the current guidelines in the community playbook focus on short and simple sentences rather than single-word sentences.
Following Micheal points, how you could approach single-word sentences could be in this format:
For example you could change “Erroneous ?” to…
"Erroneous ? " she questioned in confusion.
Please let me know if this helps.
Many thanks,
Hillary
Thank you for your explanation, so Common Voice didn’t manage to extend the keyword in “Single Word Dataset”, right? For the growing IoT application, it would be good to include more keyword suitable for voice command in my point of view.
For IoT applications which use single word commands, you would usually need these commands to be recorded by thousands of different voices.
Using an Acoustic Model on current dataset can be a solution if these words are part of the text-corpus/vocabulary, repeated many times, and recorded many times. You can use a training method where these words are specified as “hot-words”. The resultant model will have a higher error rate wrt proposed method.
Saying “Open!” and “Is the door open now?” will result in different models because of the accent thou.
Common Voice does not provide the algorithm for making people record the same sentences many many times, on the contrary, it feeds the least recorded ones to volunteers.
So adding your “command words” to the text corpus will not solve your problem either.
N-gram Language Models are also of no use for this single command case. Also, TinyML type applications cannot use such LM models due to HW constraints.
You might want to find 1000+ people to record your limited command set as a separate voice corpus as it is done commonly. If you want more voice-assistant-like features, you would need CV datasets.
On the other hand, I also want Common Voice to become more general purpose. That would require tagging and domain-specific corpus inclusion, which would require major changes in the database and SW. I recently suggested that.
I would like to hear about other possible methods thou. This is an area I want to work on. I tried that with 30 hours of CV data with bad results.