How are sentences selected for sentence-collections?


I was wondering: How are sentences selected for the sentence-collection?

By this I mean whether it is made sure that, for example, a certain amount of numeric expressions are contained in the collection:

He ate two apples.
There were 23,921 people.
Pi is approximately 3.1415.

and the same for dates, cities and so on.

Or are sentences more or less selected randomly from different corpora?

Would be interesting to know. :slight_smile:


What do you mean by “sentence-collection”? The sentence collector tool?

From the How-To:

  • Numbers. There should be no digits in the source text because they can cause problems when read aloud. The way a number is read depends on context and might introduce confusion in the dataset. For example, the number “2409” could be accurately read as both “twenty-four zero nine” and “two thousand four hundred nine”.

Well, kind of the method how sentences are selected for all the different languages.

The paragraph you have referenced is what I meant. I wasn’t sure whether such guidelines exist and how they look like. Now I know :slight_smile:

I was wondering about this because I kind of get the feeling that models like the Transformer is pretty capable of getting things like number right and so I asked myself whether it could make sense to introduce some sort of curriculum.

With that I mean that one could add specific sentences which contain numbers on purpose - as stated in my question - and hope that the underlying model does the magic of getting it right for us.

Our current sentence collector processes invalidate any sentences with figures on them, to avoid this situation. There is also another script the Deep Speech team uses to double-verify no figures are used when training models.

I see.

If I may ask: What is this decision based on? Is this a security measure that is founded on empirical studies/literature or was this decision made to make things simpler?

I hope you don’t mind me asking. I am currently just thinking about what a good sentence collection could be made up of.

E.g. one could also try to make sure that major cities and all countries are contained etc.

We just follow the advise from #deep-speech team experts and ensure the dataset we produce is useful for STT. You can ask for more details in their category, but basically the reason is the one I linked from the how-to.


Thanks for sharing this insight! I will give it a read :slight_smile: