"Sentences" with only one word

internetman · June 3, 2022, 1:07am

I just started reviewing sentences in my mothers tongue, Norwegian (Bokmål). There is an abundance of only one word sentences. Should I just approve all of these or should they be rejected?

bozden · June 3, 2022, 1:21am

Please see the following conversations/views:

Although not official CV policy, we think they are perfectly fine. As long as they are conversational, and not a dump of whole dictionary.

These words are limited, say in thousands, and the corpus will grow to millions. In the long run they will disappear. As I pointed out, it is best to mix them with longer ones.

internetman · June 3, 2022, 1:22am

Seems like someone has added sentences from a public domain source with a script or something. Also a bunch of sentences which reads “one seventy-two nine eight four” e.g.

bozden · June 3, 2022, 1:27am

If you want recognition of numbers, these are good to add… They would be boring to record/listen if repeated. So they must be mixed. Therefore dumping generated sentences are not promoted.

E.g. I lately implemented a script to pre-process OCR’d books with number conversions, and it recognized page numbers. I left them as valid, in addition to dates (years). One in 20-30 will be ok…

internetman · June 3, 2022, 1:33am

The sentences have been dumped from this source: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/ .

So I guess it’s should be quite optimal. Just sort of wierd that most “sentences” (thus far) are not sentences. Quite exhausting to skip through hundreds of one word sentences

bozden · June 3, 2022, 1:39am

I don’t understand the language. The license is OK and it is prepared for ASR, which is also OK.

How many are they? It is not ideal to have too many non-sentence single words or utterances appearing one after another.

internetman · June 3, 2022, 1:42am

Sorry for the noise. It seems like it was the beginning of the dataset which were consisting of one word sentences and written number sentences.

How many people need to approve each sentence before it gets submitted btw?

internetman · June 3, 2022, 1:46am

More noise; how about sentences which doesn’t really make sense? Like “Don’t forget to eat the dishes.”

bozden · June 3, 2022, 1:47am

Both in Sentence Collector and Common Voice Listen, 2 votes are needed to accept or reject, whichever comes first.

bozden · June 3, 2022, 1:50am

I read some conversation about these (at the start of the project) and it was decided not to include them.

bozden · June 3, 2022, 1:51am

E.g. see: Validating meaningless sentences in the Sentence Collector?

Please search the forum to read more.

Topic		Replies	Views
How do I add single word for my language? Common Voice sentence-collection	6	1789	January 16, 2022
Many single words in data set (UA) - is that OK? Common Voice sentence-collection	2	822	July 5, 2021
Single word utterances better than sentence? Common Voice sentence-collection	1	538	August 28, 2020
How unique should a sentence be? Common Voice sentence-collection	7	1084	May 15, 2019
About the new English Sentences Common Voice feedback , issue	37	3344	May 31, 2019

"Sentences" with only one word

Related topics