I just started reviewing sentences in my mothers tongue, Norwegian (Bokmål). There is an abundance of only one word sentences. Should I just approve all of these or should they be rejected?
Please see the following conversations/views:
Although not official CV policy, we think they are perfectly fine. As long as they are conversational, and not a dump of whole dictionary.
These words are limited, say in thousands, and the corpus will grow to millions. In the long run they will disappear. As I pointed out, it is best to mix them with longer ones.
Seems like someone has added sentences from a public domain source with a script or something. Also a bunch of sentences which reads “one seventy-two nine eight four” e.g.
If you want recognition of numbers, these are good to add… They would be boring to record/listen if repeated. So they must be mixed. Therefore dumping generated sentences are not promoted.
E.g. I lately implemented a script to pre-process OCR’d books with number conversions, and it recognized page numbers. I left them as valid, in addition to dates (years). One in 20-30 will be ok…
The sentences have been dumped from this source: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/ .
So I guess it’s should be quite optimal. Just sort of wierd that most “sentences” (thus far) are not sentences. Quite exhausting to skip through hundreds of one word sentences
I don’t understand the language. The license is OK and it is prepared for ASR, which is also OK.
How many are they? It is not ideal to have too many non-sentence single words or utterances appearing one after another.
Sorry for the noise. It seems like it was the beginning of the dataset which were consisting of one word sentences and written number sentences.
How many people need to approve each sentence before it gets submitted btw?
More noise; how about sentences which doesn’t really make sense? Like “Don’t forget to eat the dishes.”
Both in Sentence Collector and Common Voice Listen, 2 votes are needed to accept or reject, whichever comes first.
I read some conversation about these (at the start of the project) and it was decided not to include them.
Please search the forum to read more.