Are librivox contributions really being put into Common Voice?

The amount of books narrated on librivox far exceeds the hours of common voice recordings in many languages.

There’s something wrong.

Hi @maria2 - can you please clarify your question and assumptions here?

  • Are you suggesting that recordings from the LibriVox project are added to the voice recordings for Common Voice? If so, that’s not my understanding, although @jesslynnrose or @Gina_Moape might want to confirm.

  • If you are suggesting that sentences from Project Gutenberg - out of copyright works - are used to populate sentences for Common Voice, then yes, I understand that is the case, because they are out of copyright and therefore public domain (CC0).

  • Is this related to your previous comment about there not being enough sentences to speak for Portuguese?

Hi @maria2

If you are suggesting that recordings from the LibriVox project are added to the Common Voice datasets and the hours from the LibriVox project and Common Voice project do not match, then yes, they would not as we only have/use recordings from volunteers who used the Common Voice platform to record their voices. We are not using any external recordings outside of CV.

1 Like

As a side note: Because Common Voice is mainly STT-oriented, the inclusion of such a “feature” would harm the datasets. Here are some issues I can think of out of my head:

  • A single book will add too many recordings from a single person. That would definitely result in voice bias (and gender, and age, and accent biases).
  • Most of these audiobooks are recorded by (semi-)professionals with (semi-)professional equipment in studio-like/silent environments. That can only produce a clean dataset. If trained on a clean dataset (without augmenting), most of the models will fail in the wild, where the environment is noisy.
  • Most books have long descriptive sentences. CV is limited to 10s recordings, so it would be a huge undertaking to divide the audiobook into 10s segments.
  • You cannot just use the data without the consent of the voice artist unless it is CC-0 already.

Bias can be filtered. Clean audio may be a problem today but what about in the future? It really is impossible to improve this project.

It seems like a very manual and lazy way of doing things. These works in the public domain should be repurposed to improve and diversify the dataset

It’s not related. Only tangentially. This remains a problem because it is practically impossible to collect new sentences by speaking, for some reason. And you don’t accept material from other sources, so I’m just going to give up on the project.

How would you approach this in practice? I mean we could build up a additional librivox dataset that can be used by people if they want to. It would just be a completely different dataset than the common voice dataset, but it could be a valuable resource. Maybe librivox could even offer such a dataset on their website.

But this would require a tool that is capable to cut long audio files and align them with the written text. I thought about this for years, but I don’t see an easy way to do this in a good enough quality without a lot of manual work.

If we solve this, a lot of resources like podcasts and radio shows would be available for STT systems. There are tons of public domain and creative common audio files.

I agree with Bülent that this would not be in the scope of the Common Voice project in its current form. It would be a completely new kind of dataset. Maybe it could be done by the new mysterious company?

EDIT: The main term to google here is “forced alignment”. It is the technology that matches existing texts to long audio files.


Exactly @stergro… Alignment is a big problem and for this teams got formed. There might be some solutions for some widely used / western languages but for 112+ in CV?

Also, do those low-resourced languages have audiobooks, or will they become something like other limited language coverage datasets? Mozilla’s Common Voice project is enabling low-resourced languages.

1 Like

Many people talk quite “mechanically” in this project, perhaps it would not be so different from the content available on librivox… But yeah, podcasts, radios, there’s a lot of content more “natural” already available and I feel like sometimes we’re doing things as inefficiently and cheap as possible.

The project has to be efficient if it wants to become something serious and grandiose. Minimum effort for maximum gain. If there is so much content in the public domain, why reinvent the wheel and waste the effort of the few volunteers? I can’t understand. Datasets could be much larger and more diverse if these things were addressed.

If we can solve the alignment problem, we can build another dataset collection for these audiobooks. We can even use 0-30 sec sentences for these (30 sec is the current limitation on many systems), so we can get outside the CV recording duration limitations. 5-25 sec recordings are best for SotA systems for today.

There is nothing stopping us from using multiple datasets joined to train or fine-tune a model trained on one with another. It mainly needs experimenting and research.

But this must be another dataset. Here, in CV, the philosophy of the project is very different. It is a community-driven, volunteer-based work, where people donate.

I’m not sure about podcasts and such thou (with subtitles). If they are machine transcribed, they will not be OK for any training and must be proofread/corrected by teams.

@maria2, I’m not an expert in forced-alignment related stuff. Do you know a working methodology and codebase that works 100% on many languages, without human intervention?