General Idea
I wholeheartedly agree with the comprehensive points raised. As with anything in life, there are primary and secondary contradictions. Even the most complex issues can be distilled down to one dominant contradiction, with the rest being secondary.
In the context of public voice datasets, the primary contradiction is the shortage of available data, with quality being the secondary issue. Of course, we desire audio recordings of the highest quality, scoring 100 out of 100, but when the available audio data is scarce, we may reluctantly accept audio of just a 60% quality rating. However, with sufficient audio data, we can confidently discard medium and low-quality audio, leaving only the highest quality for training. This is particularly crucial for training text-to-speech, where pristine, noise-free data is essential.
The proposed open-source, offline, beneficial voice tool aims to enable more users to participate effortlessly and even joyfully. While only a small portion of users will benefit from the dataset (as only a fraction train models), the majority can benefit from the tool itself. Just imagine, if only 1,000 users were to use this tool, contributing merely six minutes of audio each, we would obtain 1,000 hours of raw audio material. Users would contribute anonymously, but they should be asked to provide some demographic information.
Yes, automatic speech recognition (ASR) is not perfect, and user-submitted audio can contain minor errors. However, these can be resolved. The data contributed by users serves as raw audio material, and the locally converted text is just an initial reference. This data should be further validated. Not all audio clips merit detailed validation. Before detailed validation, we should enter a quality assessment stage, eliminating audio with poor recording environments or unclear speech.
Let’s assume that only 30% of the material is good enough for manual annotation and error correction. That still leaves us with 300 hours of quality data, which is assuming that only 1,000 people recorded six minutes each. Large corporations offering voice cloud services have millions of users. They can sift through vast amounts of data to find hundreds of thousands of hours of quality audio data, keeping it private instead of making it public.
The previous explanation highlights why, despite numerous challenges, we should use such methods to collect data. The principle of “quantitative change leads to qualitative change” underlies this approach: only by amassing a sufficient amount of data can we have the confidence to adopt more aggressive and bold methods to filter out high-quality datasets.
The manufacturing process of SpaceX’s Raptor engine can illustrate this idea. Elon Musk stated in an interview that they manufacture one Raptor2 engine per day and have already burned through over 50 in tests. If their production efficiency was lower, they would conduct tests more conservatively. However, high production efficiency allows them to carry out thorough testing and rapid iterations.
Available Offline Models
Offering a voice recognition tool that utilizes an offline inference engine eliminates the need for cloud computing costs, with the tool merely needing to be published on Github. Currently, the best available open-source voice recognition model is Whisper. However, it demands a high computational power and suffers from considerable latency, leading to a low Real Time Factor (RTF).
However, I can suggest a few alternatives. One is https://github.com/Const-me/Whisper, a version of Whisper that uses Windows DirectX for acceleration. Even notebooks with integrated graphics cards can benefit from this acceleration. My notebook, which employs an R5-4600H processor and integrated graphics card (AMD Radeon™ Graphics, 6 Core, 1500 MHz), operates 4 times faster than the Python version of Whisper when using it. By using the Large model for English transcription, an RTF of under 1.5 can be achieved. This is only an optional solution, but as Whisper delivers the best results, users may choose this model when necessary.
For voice input, using Whisper might be overkill. Instead, it would be better to choose a non-autoregressive end-to-end voice recognition model. For instance, the Chinese company Alibaba publicly released the Paraformer model (https://github.com/alibaba-damo-academy/FunASR). It is trained using 20,000 hours of industrial audio data (sourced from the years of online voice recognition services provided by Alibaba before releasing this model). It performs exceptionally well in both Chinese and English recognition. On a CPU, the Paraformer-large model can achieve an RTF of 0.1 using a single-threaded torch fp32, and it can be converted into ONNX format. If accelerated using onnxruntime-gpu, the RTF is astoundingly low.
Currently, there is a project called the next-generation Kalda (https://k2-fsa.github.io/sherpa/onnx/index.html) that supports various offline, real-time voice recognition engines, including Paraformer and other open-source engines like Zipformer. I tested the Paraformer-Large model with sherpa-onnx on my own Windows notebook as a voice input, and it only used around 800M of memory, almost no CPU, and achieved an RTF of under 0.1.
Thus, I wrote a script tool that uses the Paraformer-Large model of Sherpa-ONNX as a voice input. By holding down the CapsLock key, it starts recording and recognizing speech, and when the CapsLock key is released, the recognized result is wrote to the screen. The delay is less than half a second after releasing the key, negligible for practical purposes. I use this method to comment on my code (as I’m really too lazy to type), and I believe many programmers would be happy to use such a tool. In the process of using this tool, I’ve already accumulated thousands of audio segments with high-quality recognition results. However, despite the scarcity of quality open-source audio datasets on the internet, I found no organization willing to accept contributions in this form of audio data. Hence, I wrote this post.
About Text Format
Regarding the output format issue, there are many open-source text normalization and invert-normalization models available, so I believe it isn’t a significant concern.
About Audio Lenth
You mentioned that Common Voice currently accepts audio lengths of 10 seconds.
Firstly, as mentioned earlier, with sufficient data, even without segmenting, there is an abundance of usable data just by simply selecting audio clips within 10 seconds. Moreover, if there are high-quality, slightly longer audio clips available, segmenting them might also be a good approach.
Secondly, the issue you raised gave me some inspiration. When using Common Voice, I found that the page design is quite inefficient in terms of contributing to the ‘speaking’ and ‘listening’ sections.
For the ‘speaking’ part, a short sentence with only a few words is displayed on an entire page. After reading this sentence and submitting it, we proceed to display the next sentence. This is too inefficient. It would be better to display five sentences on one page, with a recording button next to each sentence, much like a walkie-talkie. Pressing the button starts recording, and releasing it stops recording. This would allow users to process these five recordings much quicker.
For the ‘listening’ part, I need to listen to a segment, check it, click a button, and then display the next segment. This process repeats five times. I believe this procedure could be optimized. By playing all five audio clips at once, the user can listen to all five segments and then select any problematic clips via keystrokes or buttons. This would increase the efficiency of the listening validation process.
Transforming the linear, one-by-one process into a batch processing system with five items at once would significantly improve the efficiency of the current contribution mechanism.