Hello all,
Would it be beneficial to split the sentences into the following categories:
- Sentences without nouns and without digits (replacing all nouns and noun phrases into pronouns.
- Sentences that are only nouns or noun phrases (technically not sentences).
- Sentences that are only digits (technically not sentences).
The idea is to have the dataset as generic as possible, so the dataset could be split to main.tsv, name.tsv and digit.tsv.
An example:
- He went to that place, he loved it.
- John, Sara Brown, New york.
- Twenty two, a hundred and fifty.
My question is this beneficial at all for the DeepSpeech model? Later, would it be able to recognize various numbers and nouns that are in a sentence? Would it be able to handle out of vocabulary nouns and digits?
Example:
John went to that place, he loved New York.