The way Common Voice collecting data is out of the time, it's time to let users effortlessly contribute

I am writing to propose a novel idea to improve the efficiency of data collection for the Common Voice project. While the current system - which invites users to actively participate in recording and validating text - has been functional, I believe there’s room for enhancement, especially considering today’s technological advancements.

Under the existing model, users are required to set aside dedicated time to contribute to the project. This commitment can be somewhat burdensome and it limits the volume and pace of data collection. This model heavily relies on users’ initiative, which inherently could limit the scale of data we can gather.

Here’s my proposition: What if we create an online or offline speech recognition or voice input tool, powered by popular voice recognition or generation models? This tool could be used by contributors for their regular tasks such as generating subtitles, voice input, or dubbing. Every interaction with this tool would record both the audio file and its corresponding text locally. If the user agrees, with a simple click, these local records could be uploaded to Common Voice.

This way, users can simultaneously benefit from the tool provided by Common Voice, and contribute to the project, all without needing to allocate extra time solely for data collection. Users would generate the data organically in their day-to-day activities and have the control to decide whether to contribute it.

As for validating the data, we could incorporate a system similar to CAPTCHA, but using unverified voice data instead of images. Websites could adopt this voice-CAPTCHA for human verification, which would further enhance our data collection.

This approach is not unprecedented; numerous private companies are providing commercial voice-to-text services while collecting user data for training. However, they are treating these collected voice data as private assets and hence, not contributing to the open-source community. The model I propose can make a difference here, allowing Common Voice to collect data efficiently and make it available in the public domain for the larger benefit.

I am looking forward to hearing your thoughts on this proposal.


I’m not good at writing in English, so I used ChatGPT to conclude my idea and generated above content, and it did expressed my idea clealy.

1 Like

Some major problems with this idea (i.e. using ASR to input text & voice corpora) are:

  • ASR systems are not fool-proof, even the best English models have 3-5% WER, which increases considerably with dialects/variants, up to 30%. For non-western languages, i.e. low-resourced ones, it is much worse. To remedy this, one should go over the transcriptions to correct them before posting, which would be more cumbersome.
  • Inference engines for state-of-the-art models are computationally costly and mostly require GPU backends and very high RAM etc. For a community-driven project with thousands of people recording, that would require a AI-supercomputer. According to reports, ChatGPT spends 700k$/day.
  • Text corpora should be license-free. With the proposed system you cannot guarantee that the volunteer/user is using/donating his/her own sentences. He/she could be reading from a licensed book or newspaper. There are also language-specific rules in place in CV.
  • Spontaneous speech mostly does not conform to grammar rules and may result in incomplete sentences, in addition to lack of punctuation etc.
  • It is best to have the same sentence to be recorded by different people (gender, age, dialect etc) to get better models. So the text should go into CV text-corpora anyway.
  • Existing models differ in terms of tokenization and normalization. For example, OpenAI/Whisper models output numbers as digits, but CV expects them as text, same for acronyms etc.
  • Recording durations and therefore sentences should be limited. State-of-the-art systems usually limit them to 30 secs, and Common Voice has a 10s limit. Although there are some methods to convert them into chunks, they are again not fool-proof and might result in a bad dataset.

Other than these, I also thought about this idea.

Suppose we have a voice-driven and moderated chat room implemented as a plugin in a common platform. People directly converse with their voices and the output is text, which they should correct before posting. These texts will be input into the Text Corpora (this can also include domain-specific topics).

As an example for the first point, I can share my experiments with Whisper.

  • I selected the intersection of 99 whisper languages and 108 CV languages, which resulted in 58.
  • I selected 100 longest sentences from each language in CV v13.0 and normalized them.
  • I fed them to all multilingual whisper models and calculated the required measures with also the Real Time Factor.

Except for most prominent Western languages, the results are not very good. I’m currently working on fine-tuning the multi-lingual model for select languages, which seems to drop WER considerably in some cases (e.g. from 50% to 30% after 24 hours of fine-tuning), but these are not enough for the application you proposed. I’m focusing on low-resource models (namely tiny) which can also run on browsers, which might solve the #2 point mentioned above.

Here are the results for the “tiny” model, sorted by average WER as a reference…

2 Likes

General Idea

I wholeheartedly agree with the comprehensive points raised. As with anything in life, there are primary and secondary contradictions. Even the most complex issues can be distilled down to one dominant contradiction, with the rest being secondary.

In the context of public voice datasets, the primary contradiction is the shortage of available data, with quality being the secondary issue. Of course, we desire audio recordings of the highest quality, scoring 100 out of 100, but when the available audio data is scarce, we may reluctantly accept audio of just a 60% quality rating. However, with sufficient audio data, we can confidently discard medium and low-quality audio, leaving only the highest quality for training. This is particularly crucial for training text-to-speech, where pristine, noise-free data is essential.

The proposed open-source, offline, beneficial voice tool aims to enable more users to participate effortlessly and even joyfully. While only a small portion of users will benefit from the dataset (as only a fraction train models), the majority can benefit from the tool itself. Just imagine, if only 1,000 users were to use this tool, contributing merely six minutes of audio each, we would obtain 1,000 hours of raw audio material. Users would contribute anonymously, but they should be asked to provide some demographic information.

Yes, automatic speech recognition (ASR) is not perfect, and user-submitted audio can contain minor errors. However, these can be resolved. The data contributed by users serves as raw audio material, and the locally converted text is just an initial reference. This data should be further validated. Not all audio clips merit detailed validation. Before detailed validation, we should enter a quality assessment stage, eliminating audio with poor recording environments or unclear speech.

Let’s assume that only 30% of the material is good enough for manual annotation and error correction. That still leaves us with 300 hours of quality data, which is assuming that only 1,000 people recorded six minutes each. Large corporations offering voice cloud services have millions of users. They can sift through vast amounts of data to find hundreds of thousands of hours of quality audio data, keeping it private instead of making it public.

The previous explanation highlights why, despite numerous challenges, we should use such methods to collect data. The principle of “quantitative change leads to qualitative change” underlies this approach: only by amassing a sufficient amount of data can we have the confidence to adopt more aggressive and bold methods to filter out high-quality datasets.

The manufacturing process of SpaceX’s Raptor engine can illustrate this idea. Elon Musk stated in an interview that they manufacture one Raptor2 engine per day and have already burned through over 50 in tests. If their production efficiency was lower, they would conduct tests more conservatively. However, high production efficiency allows them to carry out thorough testing and rapid iterations.

Available Offline Models

Offering a voice recognition tool that utilizes an offline inference engine eliminates the need for cloud computing costs, with the tool merely needing to be published on Github. Currently, the best available open-source voice recognition model is Whisper. However, it demands a high computational power and suffers from considerable latency, leading to a low Real Time Factor (RTF).

However, I can suggest a few alternatives. One is https://github.com/Const-me/Whisper, a version of Whisper that uses Windows DirectX for acceleration. Even notebooks with integrated graphics cards can benefit from this acceleration. My notebook, which employs an R5-4600H processor and integrated graphics card (AMD Radeon™ Graphics, 6 Core, 1500 MHz), operates 4 times faster than the Python version of Whisper when using it. By using the Large model for English transcription, an RTF of under 1.5 can be achieved. This is only an optional solution, but as Whisper delivers the best results, users may choose this model when necessary.

For voice input, using Whisper might be overkill. Instead, it would be better to choose a non-autoregressive end-to-end voice recognition model. For instance, the Chinese company Alibaba publicly released the Paraformer model (https://github.com/alibaba-damo-academy/FunASR). It is trained using 20,000 hours of industrial audio data (sourced from the years of online voice recognition services provided by Alibaba before releasing this model). It performs exceptionally well in both Chinese and English recognition. On a CPU, the Paraformer-large model can achieve an RTF of 0.1 using a single-threaded torch fp32, and it can be converted into ONNX format. If accelerated using onnxruntime-gpu, the RTF is astoundingly low.

Currently, there is a project called the next-generation Kalda (https://k2-fsa.github.io/sherpa/onnx/index.html) that supports various offline, real-time voice recognition engines, including Paraformer and other open-source engines like Zipformer. I tested the Paraformer-Large model with sherpa-onnx on my own Windows notebook as a voice input, and it only used around 800M of memory, almost no CPU, and achieved an RTF of under 0.1.

Thus, I wrote a script tool that uses the Paraformer-Large model of Sherpa-ONNX as a voice input. By holding down the CapsLock key, it starts recording and recognizing speech, and when the CapsLock key is released, the recognized result is wrote to the screen. The delay is less than half a second after releasing the key, negligible for practical purposes. I use this method to comment on my code (as I’m really too lazy to type), and I believe many programmers would be happy to use such a tool. In the process of using this tool, I’ve already accumulated thousands of audio segments with high-quality recognition results. However, despite the scarcity of quality open-source audio datasets on the internet, I found no organization willing to accept contributions in this form of audio data. Hence, I wrote this post.

About Text Format

Regarding the output format issue, there are many open-source text normalization and invert-normalization models available, so I believe it isn’t a significant concern.

About Audio Lenth

You mentioned that Common Voice currently accepts audio lengths of 10 seconds.

Firstly, as mentioned earlier, with sufficient data, even without segmenting, there is an abundance of usable data just by simply selecting audio clips within 10 seconds. Moreover, if there are high-quality, slightly longer audio clips available, segmenting them might also be a good approach.

Secondly, the issue you raised gave me some inspiration. When using Common Voice, I found that the page design is quite inefficient in terms of contributing to the ‘speaking’ and ‘listening’ sections.

For the ‘speaking’ part, a short sentence with only a few words is displayed on an entire page. After reading this sentence and submitting it, we proceed to display the next sentence. This is too inefficient. It would be better to display five sentences on one page, with a recording button next to each sentence, much like a walkie-talkie. Pressing the button starts recording, and releasing it stops recording. This would allow users to process these five recordings much quicker.
For the ‘listening’ part, I need to listen to a segment, check it, click a button, and then display the next segment. This process repeats five times. I believe this procedure could be optimized. By playing all five audio clips at once, the user can listen to all five segments and then select any problematic clips via keystrokes or buttons. This would increase the efficiency of the listening validation process.
Transforming the linear, one-by-one process into a batch processing system with five items at once would significantly improve the efficiency of the current contribution mechanism.

1 Like

You gave some nice pointers, I’ll check them out. Thanks…

About the UX: Yeah, I also hate mobile-first designs (the reason behind the one-by-one recording). I use my phone as a phone and I’m a desktop man. But the statistics say otherwise, >65% of people use mobile phones as their primary device and this keeps increasing. This also gives the opportunity to record in different environments and on-the-go, where people have time. So, I’m fine with it and keep sitting at the end of the spectrum.

As it is, CV usage through the browser is fine for my language community. I’d hate them to see trying to install something on their computer, some are 80+ with minimal technical knowledge.

Although I had some counter-points stated above, the base of the problem you mention is a standing one. Datasets really do not grow fast enough.

Thanks so much for getting in touch!

While ASR and auto-transcription aren’t currently going to be a good fit for Common Voice for technical (they rarely work as well as we would need them to for data capture) and bias reasons (they don’t [yet!] work at all for some of the languages we’re collecting voice data for) and creating a wide-reaching CAPCHA competitor project are out of scope for our current team and contributors, these are such exciting ideas!

I would be really excited to see what you build in the future, with such exciting ideas!

1 Like