Numbers. There should be no digits in the source text because they can cause problems when read aloud. The way a number is read depends on context and might introduce confusion in the dataset. For example, the number “2409” could be accurately read as both “twenty-four zero nine” and “two thousand four hundred nine”.
Well, kind of the method how sentences are selected for all the different languages.
The paragraph you have referenced is what I meant. I wasn’t sure whether such guidelines exist and how they look like. Now I know
I was wondering about this because I kind of get the feeling that models like the Transformer is pretty capable of getting things like number right and so I asked myself whether it could make sense to introduce some sort of curriculum.
With that I mean that one could add specific sentences which contain numbers on purpose - as stated in my question - and hope that the underlying model does the magic of getting it right for us.
Our current sentence collector processes invalidate any sentences with figures on them, to avoid this situation. There is also another script the Deep Speech team uses to double-verify no figures are used when training models.
If I may ask: What is this decision based on? Is this a security measure that is founded on empirical studies/literature or was this decision made to make things simpler?
I hope you don’t mind me asking. I am currently just thinking about what a good sentence collection could be made up of.
E.g. one could also try to make sure that major cities and all countries are contained etc.
We just follow the advise from #deep-speech team experts and ensure the dataset we produce is useful for STT. You can ask for more details in their category, but basically the reason is the one I linked from the how-to.