Option of adding voice with its text should be available

alvynabranches · February 25, 2022, 5:38am

There should be option of adding new audio data with its text by a user which will then be followed with validations in the listen section.

Michal_Jhon · March 8, 2022, 11:05am

Yes, the feature should be available. Especially when I am driving, I can’t always use my eyes to read messages. So I should be able to use the voice feature while driving. My friend, you should use the voice feature while driving. It’s safe, and you can listen to the updated news.

manalog · October 8, 2022, 1:41am

Actually I think it can be a great idea and should be discussed seriously.

Adding this possibility could potentially help increasing a lot the order of magnitude of the dataset(s). I have found three potential advantages, and probably there are more:

Re-use of recorded clips: For example clips from instant chat apps (Whatsapp, Telegram…) but the only limit is the imagination. This could increase the variability of the model by adding sentences and recording much more in daily life context and natural conversation, that is one of the tenet of Common Voice guidelines. Now sentences are basically extracted from Wikipedia with the risk to bias the model toward an academical/scientific lexicon that can be good but makes it far from being optimized to address other useful use scenarios. There are a lot of sentences containing many complicate terms but, most of all, I have always the feeling that I am talking in the same “descriptive”, “teaching” way, and also I am listening similar way of talk from other users. This is not the only way in which humans talks;
Larger projects: One user could decide to open some CC-0 source and starting reading, then annotating and separating the clips. Similar to Librivox mechanisms, but more targeted toward ASR (so no the entire book would be required). This also could help on one hand to create more material, because sometime can be easier in this way to get 1 hour of speech from one person that maybe is enjoying reading something that makes sense, even laying on a bed with a recorder, on the other hand again to improve the model by including naturally connected sentences that, at least in my language, sounds a lot different from isolated sentences;
Creativity of the user: One could have some idea about how to say something in a very natural way, so this person can write it and then say it (or the opposite). One can have a monologue driving in the car and then annotate it. This third point could be even implemented on browser by slighting modifying the webapp.

The user, that for example can be a “verified” one to avoid trolling (more than N clips recorded/validated) could just upload a tar.gz (maxdim=n) file to the server. This file could contain a simple list.csv with “id” and “text” fields and then all the files already mp3, 32kHz, 48kb/s. In this way the load on the server would not be high: ffprobe the files to check if they comply, rename them, change id, add client_id and put on “clips to be listened” database.

These clips could be marked differently so that in case of bad dataset it can be easily removed (ex, to automatically signal it if more than N clips are reported invalid).
One should just choose the duration, that can be slightly longer than the one now present in production. I am not enough expert to determine, but on the Deepspeech playbook i read this:

Ensure that your voice clips are 10-20 seconds in length. If they are longer or shorter than this, your model will be less accurate.

What do you think?

Francis_Tyers · October 10, 2022, 1:30pm

It’s a nice idea and these kind of ideas have been brought up before and are being considered for being roadmapped by the Common Voice team. We can’t give any dates, but know that these issues are on our radar.

Hossep_Dolatian · October 12, 2022, 7:08pm

That idea sounds like a stepping to also allow future field linguists and sociolinguists to use Common Voice as a platform to collect data. Like imagine someone got a grant to go do fieldwork on an endangered or under-described language somewhere; the person goes, collects data, records people, but has no obvious place to store them beyond their laptop (and didn’t think ahead of making the data open-access). And because the language is under-described, it’ll likely take years (after the initial fieldworker) for someone to start collecting a large enough speech corpus for archiving and lang-tech purposes.

PS: Funny you mention Whatsapp and Telegram. I know a lot of people (including myself) who just ask consultants to quickly say a sentence on those apps and send a voice note as part of field work

manalog · October 12, 2022, 8:50pm

Yes! It could be a pretty easy way to scale up CV a lot. Of course it should be done with the correct carefulness not to pollute the dataset, but I don’t think it’s hard to find the right criteria and also all the clips will be validated from the community as it happens for the others.
I think it would be important to leave the possibilities to specify accents and metadata in order to have high quality labels on the data. Client_id could be the one of the uploader, but then inside the tsv file there could be voices from different people (ex. as you said field work of a researcher).
Also from a developer point of view, it seems an easy project to be carried out. I am not an expert developer, maybe by looking inside the repository I can figure it out but it would be better if the idea will be liked from the community so that the same guys that are working on the system could add this feature. I think it would be something easy for them. Basically it is an upload form for an archive containing already clips and tsv; then the important thing is to pass them through a validator to check for length and format and then they are ready to join the other clips on the server.
But I don’t know, there are chances that we are underestimating something important both from safety and technical side.