Separate sentences into categories

daniel.abzakh · October 26, 2021, 5:29am

Hello all,

Would it be beneficial to split the sentences into the following categories:

Sentences without nouns and without digits (replacing all nouns and noun phrases into pronouns.
Sentences that are only nouns or noun phrases (technically not sentences).
Sentences that are only digits (technically not sentences).

The idea is to have the dataset as generic as possible, so the dataset could be split to main.tsv, name.tsv and digit.tsv.

An example:

He went to that place, he loved it.
John, Sara Brown, New york.
Twenty two, a hundred and fifty.

My question is this beneficial at all for the DeepSpeech model? Later, would it be able to recognize various numbers and nouns that are in a sentence? Would it be able to handle out of vocabulary nouns and digits?

Example:
John went to that place, he loved New York.

stergro · October 26, 2021, 8:17pm

Hey, I have two thoughts about this:

For digits and the words “yes” and “no” there is already the " Single Word Target Segment" that you can download further down on the dataset page (https://commonvoice.mozilla.org/en/datasets)
I generally like the idea of categories, but I belive that it would make much more sense to separate the sentences into local varieties. For example, a subset for British English, American English, Nigerian English,… with sentences that have specific local terms that people outside that area don’t know. This would also be very beneficial for languages like Portuguese (with a Portugal subset and a Brazilian subset). It wouldn’t be useful to create entirely new languages in CV for every variety, but a feature like this, where people from different locations can select different local varieties of the sentences, would be great.

This would mean that there is a big general corpus for everyone in a language, plus local subsets that you can activate in your profile if you speak a variety of the language.