Can I download my voice data?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · November 22, 2017, 3:52am

I hope soon we could have a good model trained for DeepSpeech publicly available, until that I’d like to know if it’s possible to download my voice data for training an acoustic model for cmu sphinx. Thanks and I hope that you guys start the work on other languages.

mhenretty · November 22, 2017, 10:27am

The answer to all 3 of your questions is yes:

Public Voice Models: we are working on releasing some very soon!
Download your voice data: we are working on having that soon!
Multi-language support: Coming in early 2018!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · November 24, 2017, 5:12pm

Thank for your answer I’m glad for your reply.

mhenretty · November 27, 2017, 11:31am

Very soon you will be able to download your voice data, along with everyone else’s. Unfortunately, we won’t be able to know which data is yours (for privacy reasons).

StripedMonkey · November 30, 2017, 1:32pm

It’s a bit unfortunate that we won’t be able to have voices of individuals even if we don’t know who they are. The ability to tell what multiple people are saying and identifying individual voices is an interesting use. I totally get the privacy concerns though.

mhenretty · December 4, 2017, 4:00pm

Check out the Tatoeba dataset from our Download page, it has utterances grouped by speaker.

voice.mozilla.org/data

Franck_Dernoncourt · December 5, 2017, 1:33am

we won’t be able to know which data is yours (for privacy reasons).

What’s the privacy concern?

mhenretty · December 5, 2017, 1:44pm

Just general ones really, we believe our users don’t want to be identified, so we do our best to protect them.

Franck_Dernoncourt · December 7, 2017, 5:24am

How about indicating the speaker ID for each audio file in the public corpus, and privately indicating to each voice contributor their speaker ID?

mhenretty · December 7, 2017, 2:26pm

@Franck_Dernoncourt what are your needs for that? Ie. tell me about your research, and why the Tatoeba dataset I mentioned above doesn’t solve that for you.

Franck_Dernoncourt · December 7, 2017, 11:38pm

I simply would like to be able to download the audio files I contributed to, in order to:

Train an ASR customized to my voice (I use speech recognition daily for my work)
Train a text-to-speech engine customized to my voice
Analyze which words I mispronounce (and more generally analyze how I speak)
Benchmark off-the-shelf ASR engine performance with my voice
etc.

As a principle, I think it’s good practice to allow users to download the data they generated (e.g., https://en.wikipedia.org/wiki/Google_Data_Liberation_Front).

Th3DEAD · December 28, 2020, 5:50am

Absolutely on point. Would love to be able to retrieve a dataset of my own voice to create my own tts from it.
Since unfortunately it’s hard for one person to make it’s own database from scratch…

Topic		Replies	Views
"Download My Data" broken? Common Voice sentence-collection , issue	5	1482	May 4, 2021
How to download with a script the full CommonVoice21 dataset? Common Voice learning , issue	5	353	July 22, 2025
Sharing the dataset Common Voice dataset	3	1349	November 22, 2017
Downloading raw audio data Common Voice	3	798	June 22, 2018
How can i have my own voices? Common Voice	6	468	August 3, 2021

Can I download my voice data?

Related topics