Sharing Common Voice Through peer-to-peer

Hi everyone! I created a torrent for all datasets on Common Voice currently available, you can download though this link: magnet:?xt=urn:btih:6318a9e4735b4cdc6c88ccbd9f16e9c1c016ed88&dn=Common+Voice+V2+March+2019.rar

2 Likes

This is something I was asking about the other day, but your magnet link seems not super-available. @nukeador Could we create a torrent and spread it on some hosts to help disseminate?

Yep, I’ll do that. I was hoping to get some people seeding from this community.

If you can share it with me directly, I can start seeding.

How do I do that? Like I’m already sharing through p2p.

Direct link to some hosting? Then I can download it completely and quickly and start seeding. Because right now it’s very slow :confused:

Well, I don’t have it. I downloaded from from Common Voice and organized each language in its own folder. I would host on mega, but I don’t think you can transfer 35Gb on it. I hope my torrent get some seeders, we just need some seeders to start, I mean the whole point of p2p, is how scalable it is.

Hi everyone,

I would like to explain why the Common Voice team prefers the dataset to be available just from the official site and don’t create new unofficial places.

The main reason is that we want to make sure we have a way to contact to everyone who has downloaded the dataset in case someone request us to remove their voices from the dataset. This is important and we want to respect people’s choices.

If you feel the current download process is not working for you, let’s talk about that and find solutions, but I would like to request we don’t create other places for this download so we also avoid people getting confused about the official one and potentially downloading an outdated or manipulated dataset.

Thanks!

@nukeador Got it, I will stop my torrent, But I have a different opinion, what’s the point of the dataset being CC-0, if you can’t share it? With regards with people wanting their voices removed, how do you know which clips belongs to whom?

2 Likes

I always prefer to download through p2p, I think Common Voice also should have this option as many open source programs and linux distribuitions.

Having cc-0 is specially important to be able to use the dataset by many different commercial and non-commercial entities. In terms of sharing, we prefer to always point the official site because of the reasons I listed, we think people will understand.

The site knows which speaker ID belongs to which user, but this information is not exposed in the dataset for privacy reasons.

What’s the problem you are currently experiencing when downloading the dataset? Is it speed? Other? We can look into it.

Thanks for clarifying, One think that was odd was is the tar files, you have two extract twice, why not use .zip or .gzip?

Do you mean tar.gz file? You should be able to extract it at once with 7-Zip, command line or any other archiver that supports this format.

@nukedor I think all the dataset should be put into one file, so if I extract I should have a good workspace like this:

And maybe split them into subfolders, having a half a million files in one single folder makes it harder to play some of the files.

Understood. Pinging @gregor and @kdavis here for their feedback on how to improve this.

I think having both options available, all in one zip and each language in a separate zip, seems reasonable. Some people want to work with all languages, and some with only a single language.

2 Likes