I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened)

Dear @mary thank you for your contributions.

Several of the issues are things that are being worked on, for example the visibility of sentence collector is currently roadmapped, and improvements to sentence collector in general.

In terms of repetitive sentences, this is something that can be taken care of in the text corpus, if there are repetitive sentences they can be removed. It would help if you let us know which language(s) you are contributing to.

In terms of advertising, we rely on the language communities that are represented on Common Voice to help us with that, with over 100 different languages participating so far, this is far more than a single project can accomplish on its own. And individual communities know their social media needs better.

We hear your frustration, and would be happy to help you come up with constructive ways to participate. Thank you again for your contributions!



Hey @mary, I’m with @daniel.abzakh on this. This is an open source project which belong to us, language communities and Mozilla Foundation is a facilitator who helps us. Being open source means we should contribute to it in any way possible.

I’m nearing to two years of commitment to this project. When I first came here, I also saw many missing parts and pieces, including what you have mentioned, and some more which were corrected along the way. You can see some of them here. You can also check the issues and feature requests on github for a larger set.

The project currently has only a single engineer for the application and all backend stuff, and it is not easy to maintain such a big code base. In such scenarios it is not easy to completely change a feature and/or add new ones without breaking some stuff. But I can see from PRs on github it is continously worked on.

I cannot say anything specific on behalf of the project or your issues (errors, your language you’ve been contributing [Arabic?] etc), but I heard about the following:

  • The community manager position has been open for a while now (since July 2022) and they opened a call, which was closed after a time. I’d like to assume we will have someone to guide us for new campaigns :slight_smile:
  • There has been a master thesis on CV UI/UX two months ago which many of us participated. It was about Sentence Collector and it’s integration into CV frontend. I hope it will be implemented so that SC comes into view.
  • Your problem with recording the same sentences lies on the fact that the text-corpus is not enough. You need to add some new sentences/vocabulary as a community. If it is Arabic, you can check your detailed statistics here, or choose your language if it is not.
  • Infinite-loop-error you mentioned might be the pop-up which appears while sending, right? It is a server-side error (from database) which prevents the same person recording the same sentence. You should cancel it, the code will not allow it. But you shouldn’t be allowed to records those sentences anyway, which is another unknown bug.

As a volunteer knowing some of these stuff, perhaps I can help if you can be more specific.


Maybe I’m not being clear enough. I don’t want to keep saying the same words! I want to speak words I’ve never spoken before in the project. It’s not just a corpus problem.

The words alone are in the public domain. And a corpus isn’t going to solve that, because I’m going to keep repeating words I’ve already spoken thousands of times or skipping until I get to different words.

Another thing that is irrational is that there is a short time limit to speak the sentences, and sometimes the text there is too big! Sometimes it is impossible to say everything.

The main problem is not the repeated sentences, the main problem is sending words to me that I have already spoken for me to speak again. Useless. I want to speak only words that I have never spoken before, to enrich the vocabulary of the dataset

Increasing the text corpus is trivial, the words alone are in the public domain, it’s just literally taken from a dictionary. And even then, it wouldn’t solve the problem.

I would keep getting words that I have already spoken, that I don’t want to speak anymore, which is annoying. I want to receive only new words that I have never spoken, the project will be more useful, efficient, rational like that. Each user should have the option to receive only words that they have never spoken, to be more useful and not to get bored.

What is the point of receiving words that I have already spoken to speak again? Or keep skipping all the clips until I come up with some rare word I’ve never spoken before? God. How does the project go forward like this?

Most of the common expressions and usual words I have already spoken! Expand the corpus with words from open dictionaries and only send new words to users.

This would accelerate the development and seriousness of this project to a great extent. This should be the main focus of development.

You are free to propose new sentences. In terms of pronouncing words that you have already spoken, this actually makes sense, words are pronounced differently in context, and if you are seeing the words multiple times it may be that they are in multiple contexts. In any case it is difficult to respond as you have not given any concrete examples, and it is unclear which language you are working with.

A lot of work for one person. We need to augment the corpus with dictionary words, what’s the problem with doing that? There are no problems with copyright!

And very little changes the pronunciation of the context of words in expressions. The project needs to be improved, these flaws applies to all languages.

I have already sent emails to nvidia and Bill Gates ’ Corporation, among other partners that cost the project. This problem is very serious and it is not being addressed in the right way. Let’s wait. They need hire more developers, if that’s the case.

It is always good to have a larger vocabulary and having those words in different sentences/contexts and spoken by different people (gender, age, accent etc).

But we need to be clear on some basics:

  • Common Voice is not aware of any language specifics, such as the dictionary’s universe or if words in a sentence is spoken previously, statistics of that etc.
  • What is recorded is defined by the text corpus of that language, which must be public domain/CC0 and which must be provided by the language communities.
  • It is best to have sentences spoken in the daily language, conversational ones. But as you might know, in any language only a small portion of the whole vocabulary is commonly used in everyday conversations, and the vocabulary changes in time. Some words are not used anymore (e.g. only phrased by elderly) and some new ones come out (e.g. we didn’t use the phrase “face mask” or “blood clotting” in everyday language until we hit a pandemic).
  • It is not a good idea to dump a dictionary to the text corpus, they should be valid sentences from conversations. There may be single word sentences like “Tea?” (in place of “do you want some tea”). But it is better to have longer sentences. The acoustic models rely on sounds following one other, and different word combinations is good for them.
  • Different pronunciations matter. It can be accent differences (e.g. “Tea?” or “Tea…” are spoken differently and mean different things, “do you want tea?” for the first one and “I’m drinking tea…” as and answer to a question), or different pronunciations by different people from all over the word.

Now let us do some simple analysis, assuming some values and doing some calculations:

  • We aim for a voice-AI model for a language, we will need both an Acoustic Model and a Language Model. (Note: Common Voice is here for the acoustic model, you can add a specific or broad language model for you application’s purpose, irrelevant of the vocabulary here on CV).
  • We aim for <10% WER (word error rate)
  • We have a language of 100k word dictionary, but only 50k of these are on CV text-corpus/voice corpus…
  • With sufficient recordings, you might get WER=50%, and for a specific language model added on top of it, you might get WER=30%.

To improve these:

  • You would need more recordings (duration of the training set is the most important one, along with diversity).
  • You would need more text-corpus to achieve that (to increase the vocabulary)
  • You need more people with different demographics (gender, age, accent) to speak these sentences.
  • You can finetune your language model with more data for the purpose at hand (e.g. producing subtitles for news is different from transcribing a conference in the area of medicine)

For better domain specific data, you would need domain specific text-corpora and recordings thou, which would need a more-or-less major change in CV workflows.

An acceptable model can be achieved (say) by 100h of validated quality (correctness, diversity, vocabulary etc) recordings. Assuming an average recording is 3.6 sec, 1 hour data is 1000 recordings. For 100h you would need 100k recordings. If each sentence is spoken by two different people on the average, that would mean 50k sentences in text corpus.

To get better results:
New sentences to SC (with new vocabulary) + new volunteers => new and more recordings

This is how things work.

A lot of work for one person.

Yep. You need to build a community for your language, some of them will be more interested/dedicated like you, so you might form a core group for social media, events, campaigns etc.

1 Like

The main goal of a language community is the betterment of the language data (text & voice corpora) on CV, so that a better voice-AI comes out.

It is like steering. You turn left, if too much you turn a bit right, if slow you hit the gas.

Currently, CV releases are 3 months apart. I think it is an ideal timing for this workflow:

  • A version comes out, you analyze the data (also taking the previous releases into account)
  • You find what you are lacking, or how much you improved wrt previous version.
  • Plan for the next release (campaigns etc)
  • Go to start

For the analysis part, I prepared two webapps. I think these will help the crucial part of these for all languages.

For example, the results of v12.0 for my language Turkish is here.

From the Text Corpus tab I can see the following:

Total tokens are ~39k.
Turkish dictionary has ~90k entries, but many of them are somewhat old (Farsi or Arabic roots) and are not spoken very much now.
On the other hand, Turkish is an agglutinative language (words are expanded by postfixes). So 39k does not represent the root words but also includes plurals etc.

So I need to work on vocabulary and extend the text-corpus.

If we look at the character or word/sentence distribution:

I see that many sentences are short. We added many conversational short sentences, this is why.

So we need to work for longer sentences.

I added all books from a famous writer, which became public, so many rather old vocabulary is also included.

So I need to find CC0 sources for new text-corpus, which are longer sentences and include new vocabulary.

Whenever I find some resources, I analyze them against the current corpora (offline scripts) and calculate the amount of new vocabulary. I also have similarity filters, where I also filter out very similar sentences.

After collecting a couple of these, I merge them and shuffle them before posting them to Sentence Collector for validation, which is done by the community. From each book, I can get some 3-5k sentences and a few hundred new vocabulary words.

I hope this example helps with your language.

There are open dictionaries of almost all languages, why not just take from them and make random sentences?

Are there any words that are not in the public domain? Why do they need to be sentences and not words?

The collection of individual words is also key. That should be an option! And no repetitions…

The text-corpus should be done according to the demands of the project, by an artificial intelligence. People keep putting the same phrases as always, I’m going crazy already. And I’m not going to waste time changing this myself, it’s impossible.

As long as the text-corpus is fed only by people, and as long as there is no option to dictate individual words, without repetitions, I see no point in continuing to strive.

We may not see the utility of adding this in the short term, but it will certainly make a lot of difference in the long run. With great wealth of vocabulary and dialects this will result

Please take this into consideration.

I will no longer waste time adding different words in the sentence collector and I will no longer be talking and repeating the same sentences and words. I want to receive new sentences, words I’ve never spoken. That makes me feel useful.

There are already some topics where these are discussed. The first one was asked by myself for example.

As you will see, these are bad practices: Dumping the vocabulary, auto-generate sentences by concatenating words, using a bad AI to generate them, etc.

Also, if you use some AI text generator (such as ChatGPT) you must be sure of the copyright of the output, it MUST be CC0/public domain, GPL etc is a no go.

The problem is: If you do not add quality text-corpus, the quality of the dataset will drop and there is usually no going back. Whenever some text is recorded in audio, it will stick into the dataset and it can only be removed with post-processing before training.

That makes me feel useful.

It is OK (and advisable) to write your own sentences looking at a dictionary though.

Please search for more, there are many…

The quality of the text-corpus is already very low by human action, believe me. Many grammatical errors and repeated sentences.

Too manual, too tiring, impossible. The text-corpus should be more complete, it should use more isolated and varied words, automatically. Not only by humans.

In addition, there should be the option for us to speak only what we have never spoken before. I repeat that need.

And I want to understand why I can’t validate the audios. Just leaving the account. And sometimes it appears that the clips are over, but there is a considerable gap between the uploaded and validated clips.

Nothing makes sense. You need to automate these things and make the project smarter. Demand more developers, I have already done my part, contacted the financiers of the project.

We should not struggle for a problem that is clearly from the project code.

I’m not blaming anyone specifically, but Mozilla.

Writing your own sentences is not the only method, I mentioned it because you seem to be picky about the vocabulary. Some bulk methods include:

  • Books that go to the public domain
  • Your own chats
  • Open a chat room and converse with the community respecting sentence collector rules
  • Use of Wikipedia data through cv-sentence-extractor

Could you please share with us which language we are talking about?

PS: I repeat, I’m also a volunteer like you. And it is not related to coding (only). What you are proposing is a remake of the whole system.

Yes, but we are talking about different things. It remains manual, arduous and inefficient work.

I mentioned this in Book-reading mode (aka "ordered sentences collections") and I still consider this the future of any sentence-collection tool:

Collect the sentences from users speaking up what they are currently reading in their browser (with a dynamic highlighting system part of a browser extension)