Book-reading mode (aka "ordered sentences collections")

Currently, the corpus are collected, tokenized into sentences which are filtered/stripped, randomized and provided to the user for voice collection.

This is perfect for machines but painful for humans.
Reading these sentence is still an effort rather than a pleasure (for average humans), mainly because of the lack of meaning conveyed by the individual sentences : We are not learning anything new while reading these.

It’s also a waste because many of these sentences actually come from interesting books or articles that could be of actual interest.

Pros of the current system:

  1. Filtered sentences are unequivocal from the machine PoV (ASCII, language, characters, …)
  2. Randomization ensure vocabulary / lexical field variety (… in theory)
  3. The lack of context make sentences sounds weird what (in theory) forces users the read them at least once before starting the recording which, (in theory) ensures a more fluent diction.

Cons of the current system:

  1. Lack of context and ineligibility and inadequation between sentences and speakers => Painful task. Limit long-term contribution. Keeps kids out.
    Does it make sense for a STT system (or any other), to be fed a sentence like:

    “it consume processing time for no use what imply a regular performance drop.”

    … with 95% of the readers not even understanding what’s these words are about when they recorded their clip? I’m confident this negatively affect how the neural network inference exposed to words like (“consume”, “time”, …) outputs probabilities in the vast majority of totally different (and more natural) contexts.

  2. Wrong grammar/syntax caused by filters (names/abbrev/numbers filters…). Wrong sentences cause bad clips and biased data.

  3. Name, numbers, … disbalance or lack of cultural adequation due to corpus sentences being randomly chosen or longer sentences being discarded.
    (Eg: over-representation of technical or modern terms or predominance of poorly written language using less elaborated, meaningful or symbolic language)

  4. Lost of the “topic” attached to sentence and their original text.
    (Users can create custom scorers for the purpose of handling different IA fields but the current sentence randomization loose what “type of litterature a sentence comes from” what makes more difficult to launch topic-oriented campaigns. Eg:

    • vocal assistant: Use a “daily modern life corpus”.
    • Science-oriented IA : Use wikipedia.
    • Language-oriented research: Use the classical/antic litterature corpus.
    • Educacion / language learning application: Use kid books corpus

Suggested solution:

Assume that the effort required by the user to read a real book is lower than the effort to read sequences of prefiltered & randomly chosen sentences.
In the first case, the 5% or 10% of sentence unsuitable sentence are dropped after the recording instead of providing 100% machine-usable sentences and 100% of the corresponding clips.

=> Create a “book read mode” (or sentences “collection”)

Then it would start to actually be a funny task.
A 10 years old kids as well as an eldery could read stories, novels and litterature.

I understand it’s a completely distinct mode of operation : A more human-friendly and pleasant “Read a book for blind people”-like mode which preserves the sentences as the basic unit of recording.

And if the alignement heuristic improve some years later, then a higher proportion of the recording (eg: longer sentences) could be processed based on the existing full book recording.

1 Like

I’ll just put my two cents on why this may seem like a good idea, but probably isn’t.
First, you talk about getting kids involved. Sorry, that is not possible, mainly due to legal reasons. Even with guardian’s approval, this would be too risky, especially since there is way too few people at Mozilla working on this project to act on any complains.

Second, Common voice favors “real world” recordings, with background noise and stuff. This implies smaller batches that can be recorded in any downtime, while “book” mode would probably make sense only in cleaner, “studio” environment, with more time dedicated per session.

Third, Common Voice aims to provide conversational recordings - recordings of sentences, that can and do occur in everyday speech. Also, the sentence are supposed to be 10-14 words maximum. Books and the style of sentences therein don’t really align with the goal. (You provide a measure of 5% to 10% of unsuitable sentences dropped. I don’t know what books you are reading, but just to align with the required sentence and recording length, you would probably need a whole order of magnitude more sentences dropped)

Last what i can think of now as a counter argument, is that ordered sentences create significant bias on all parties involved. There is a reason why the reviewing handbook includes basically “first listen, then read to check”. Sentences in order that make sense would just make this situation significantly worse.

1 Like

I understand and share some of the worries but not some of the others. I’ll try to dig into each of them separately.

legal (US) stuff

Making CommonVoice a more enjoyable thing (for both adults and children) does not make it illegal by itself.
There is even an 4-19 age-category in the profile and ~4% kids voice in multiple languages ¹ so far. They are many compelling points in terms of education, language learning, … It’s up to Mozilla US to clear up that legal stuff once for all.

The project does not explicitly forbids minors currently and if were to happen, the next step would be a porn-like advisory popup before users record/publish their own voice :slight_smile:

Aim of conversational recordings

Common Voice aims to provide conversational recordings - recordings of sentences, that can and do occur in everyday speech.

I’m sorry I’ve to disagree. The FAQ states (and I totally agree with this):

The goal of the Common Voice dataset is to enable anyone in the world to build speech recognition, speaker recognition, or any other type of application that requires voice data. A voice assistant is just one of many types of applications you could use the dataset to build.

The dataset could help transcribing automatically theater records, radiophonics or TV archives or interviews and conferences and basically any human voice content recorded since the invention of the Phonograph. It could help for human-machine interface (and not only conversations but programming… or gaming). Advances in transcription could in turn provide advances in alignment leading to broader (aligned) dataset and subsequent improvements in voice synthesis…

By the way, voices are currently collected via users reading written text (with an imperfect punctuation and no context) which is already quite distinct from conversational recordings.

required sentence and recording length:

Books/articles sentences are longer and more elaborated and we all agree that having short sentences is (currently) necessary for machine-learning.
But:

  1. How much of this unusable sentences exists in an large text? I guessed 5% or 10% and you rightfully disagree. But I can’t argue nor prove about any number. Some testing must be done to see how (in)adequate books or articles are.
  2. But it’s perfectly possible to forcefully cut long sentences in order to force shorter clips: Think about a cursor highlighting the region to read and could very well stop at the next comma, colon, … waiting for the user to send or confirm it.
  3. This percentage of usable sentences within a book will only increase over time. For example, advances on DSAlign may very well allow longer sentences in the future. Having the clips already recorded won’t be bad at all. Still, it could be to know from @Tilman_Kamp if any unsurpassable hard limit is already known.

ordered sentences create significant bias on all parties involved

I failed to understand that point. Did you meant a reviewer would tend to read the article/story as well instead of concentrating on the actual clip quality and validation? If yes, I’m not sure about that but even if true, books-originating clips could be randomized before being submitted to the review’s queue.

"real world" recording

About your second worry of “real world” recording (I kept last because I think it’s a concerning one): real world data are desired for robustness of the training dataset. Robustness is welcome but comes at the price of many more recorded hours.

  • It’s likely that book or article reading may keep users talking longer than a typical sentence-reading session (what is good for the project) but I doubt conditions to be so much different for these users.

    • I don’t share the assumption these clips would be “too perfect”. But I believe they would be more natural unlike current context-orphaned sentences.
    • But even if they were (clearer, better pronunciation, less noise, …) it would not harm the training dataset at all.
  • Absence of noise is not a blocker. I remember having read research papers successfully adding artificial background noise to strengthen their final STT. I’m pretty sure it could be done as a post-recording (or pretraining) filter. It’s easier to add than remove.
    [Next commonvoice campaign: Record only your background noise :wink:]

  • Last but not least, it’d be easy to log whether a sentences comes from a single-sentence or book-mode, so it’d be possible to balance the training dataset to provide either more “good quality” records or more “noisy records” and actually determine what an ideal balance looks like.

Overall, bringing new audiences able to give hundreds if not thousands more hours of voice is an infinitely superior benefit.

Major blocker

There is major blocker not mentioned: The cost of adapting the existing platform (UI, database structure, …)

If it’s deemed too hard, I think the concept could be advantageously implemented using a browser extension
When activated (in reader-view), it could start selecting regions of text, sentence after sentence, and provide the usual “record/stop/send” icons. Once recorded, continue and highlight the next adjacent sentence.
One issue of not using the commonvoice website would be the inability to control the license of the content users read and upload. But this could be mitigated by hardcoding a restriction domains-list known to provide CC-0 contents or page having the CC-0 in their HTML).

¹ IMHO a 4-19 years old category is a mistake. Splitting in two categories before/after 14 would be a more useful voice-classification machine-wise.

1 Like

https://commonvoice.mozilla.org/en/terms, 1st point. Although contribution under supervision is permitted here, I recall it being generally frowned upon in the project chat.

Theater records, interviews, conferences, and to some (major) extent other sources of voice records all however contain mainly monologues or dialogues, which are in their form different from most of the sentences found in novels or other books. (But your point is accepted here)

Exact numbers: from my attempts to get some sentences from books into sentence collector, the acceptability rate without having to do manual preprocessing could lie around 20% maybe? It could vary wildly author to author, but then generally, you can’t rely on any book published after or during 1950’s due to the probable licensing issue. (With pre-1950’s you can sometimes rely on the “70 years since author death” rule, the older the book the better regarding this though)

“Bias”: with sentences in order, it brings the attention of all parties involved away from the goal, which is to have exact recordings. That includes the recording people, who may abbreviate something or skip a word here or there, which, in the context of the text, doesn’t change anything, but misses the goal of Common Voice of providing recordings with accurate transcriptions. For example, I would personally definitely not trust any reviews i made in the sentence collector page in the time before sentence shuffling, that were of a sentence that was a part of longer text. It just didn’t allow as much focus as needed.

Major blocker could under the right circumstances not be as major as it may seem. I can imagine this as a GSOC project, provided there is anyone capable of mentoring one and common voice participates in GSOC. Texts for recording could then be collected manually as is done now with any other texts.