Currently, the corpus are collected, tokenized into sentences which are filtered/stripped, randomized and provided to the user for voice collection.
This is perfect for machines but painful for humans.
Reading these sentence is still an effort rather than a pleasure (for average humans), mainly because of the lack of meaning conveyed by the individual sentences : We are not learning anything new while reading these.
It’s also a waste because many of these sentences actually come from interesting books or articles that could be of actual interest.
Pros of the current system:
- Filtered sentences are unequivocal from the machine PoV (ASCII, language, characters, …)
- Randomization ensure vocabulary / lexical field variety (… in theory)
- The lack of context make sentences sounds weird what (in theory) forces users the read them at least once before starting the recording which, (in theory) ensures a more fluent diction.
Cons of the current system:
-
Lack of context and ineligibility and inadequation between sentences and speakers => Painful task. Limit long-term contribution. Keeps kids out.
Does it make sense for a STT system (or any other), to be fed a sentence like:“it consume processing time for no use what imply a regular performance drop.”
… with 95% of the readers not even understanding what’s these words are about when they recorded their clip? I’m confident this negatively affect how the neural network inference exposed to words like (“consume”, “time”, …) outputs probabilities in the vast majority of totally different (and more natural) contexts.
-
Wrong grammar/syntax caused by filters (names/abbrev/numbers filters…). Wrong sentences cause bad clips and biased data.
-
Name, numbers, … disbalance or lack of cultural adequation due to corpus sentences being randomly chosen or longer sentences being discarded.
(Eg: over-representation of technical or modern terms or predominance of poorly written language using less elaborated, meaningful or symbolic language) -
Lost of the “topic” attached to sentence and their original text.
(Users can create custom scorers for the purpose of handling different IA fields but the current sentence randomization loose what “type of litterature a sentence comes from” what makes more difficult to launch topic-oriented campaigns. Eg:- vocal assistant: Use a “daily modern life corpus”.
- Science-oriented IA : Use wikipedia.
- Language-oriented research: Use the classical/antic litterature corpus.
- Educacion / language learning application: Use kid books corpus
Suggested solution:
Assume that the effort required by the user to read a real book is lower than the effort to read sequences of prefiltered & randomly chosen sentences.
In the first case, the 5% or 10% of sentence unsuitable sentence are dropped after the recording instead of providing 100% machine-usable sentences and 100% of the corresponding clips.
=> Create a “book read mode” (or sentences “collection”)
Then it would start to actually be a funny task.
A 10 years old kids as well as an eldery could read stories, novels and litterature.
I understand it’s a completely distinct mode of operation : A more human-friendly and pleasant “Read a book for blind people”-like mode which preserves the sentences as the basic unit of recording.
And if the alignement heuristic improve some years later, then a higher proportion of the recording (eg: longer sentences) could be processed based on the existing full book recording.