Some data sets are too large

1270604909 · April 13, 2020, 11:03am

Is it possible to divide the data sets into smaller data sets and download them separately?

nukeador · April 13, 2020, 11:10am

Currently the only process we have right now is the full dataset publication, but we are working on improving it this year so we can have more frequent or continuous access to the updated dataset.

david-song · April 15, 2020, 12:07am

Hi, what’s your main issue with the file size?

Are you having problems downloading the data sets via your ISP, or is it a storage issue? Is it that the file won’t resume, so you end up downloading over and over again?

Would a torrent option help you?

1270604909 · April 15, 2020, 1:54pm

Thanks for your reply,torrent option is a great idea,it will be more stable to download.Data sets can be classified by language and downloaded through different links, so can data set in a language be divided into smaller data sets and downloaded through different links?download speed and storage both can be the limitations.

stergro · April 16, 2020, 2:33pm

If you use a download-manager you can restart the download whenever it fails without loosing data. You can also pause and restart the download this way.

Using a download manager is a little complicated in this case, though. You have to enter your e-mail address to start the download, then copy the URL of the file into the download manager and stop the download in the browser.

isomorph70 · April 20, 2020, 2:42pm

Problem: Some data sets are too big.

The Solution : A browsable directory structure accessible via the web.

A way to implement it is by building a triee data-structure as a tree of nested directory structure and then share it via IPFS.

And that is exactly what I have done.

A sentence can be found by following a path down that spells the sentence. The data related to a sentence can be found in the directory named ‘¤’ in the last directory.

Here is a link to the first version of such a directory structure.
https://ipfs.io/ipfs/QmcA1QMu7UXcBJTAPE1b44fhYv9CWtr75KAhoachq4ajhC

Using IPFS makes it possible for the participants to make backup and use,the whole or only part of the database.

Participants can also help guarantee that the database, don’t disappear from the internet.
(Have self been participating in 3 projects, that removed their web site).

This is the first version, so we can still brainstorm about next version.
But hope I have given you a little glimpse of what you can do with IPFS.

stergro · April 20, 2020, 8:41pm

IPFS looks like an interesting technology.

Again: why is this a problem for you exactly? For which use case do you need a small dataset? Most people here consider the dataset too small, the goal is 10 000 hours and no language has reached this goal yet.

isomorph70 · April 20, 2020, 9:02pm

Use case: I want to learn a language, and I want to hear how it sounds like.

Yes I know it is made as a database for machine learning, but I think it could be used for many other things. Similar Wikipedia was made as a encyclopedia, but people have used it for machine learning projects, and testing text compression e.t.c.

nukeador · April 21, 2020, 10:39am

Thanks for your interest into making our dataset more accessible.

But I would like to explain why the Common Voice team prefers the dataset to be available just from the official site and don’t create new unofficial places.

The main reason is that we want to make sure we have a way to contact to everyone who has downloaded the dataset in case someone request us to remove their voices from the dataset. This is important and we want to respect people’s choices.

I totally understand the current download process is not working for you and I want to note that on our roadmap for this year we want to enable a more granular access.

I would like to request we don’t create other places for this download so we also avoid people getting confused about the official one and potentially downloading an outdated or manipulated dataset.

Thanks!

isomorph70 · April 23, 2020, 2:47pm

Thank you for replying, please allow me ask a few question, to understand better.

How many people to date, have asked to get their voice removed?

isomorph70 · April 23, 2020, 3:30pm

I understand this problem, many other projects have a public key that they sign
the download data with, to achieve this goal.

IPFS have a namesystem called IPNS, that do exactly that. It gives you a link to a achieve/file/directory/ that is crypto signed, so you are sure that it is official and not manipulated.

IPNS links can also be updated, so that if the data get updated the IPNS link will point to the new data. When some voices get deleted, you just update the IPNS link and the old voice will die as it gets garbage collected on participant systems.
I find this a little more practical since it is automatic, then have to send out emails and asking people to delete data manually.

nukeador · April 24, 2020, 5:38pm

Thanks for the insights, this is good information as we evaluate how to improve the process. Right now we don’t have the bandwidth to officially host the dataset on IPFS.

Thanks for your understanding.

nukeador · April 24, 2020, 5:47pm

Total profile deletion requests: 110
Out of those, people who want clips deleted: 66

isomorph70 · April 28, 2020, 6:05pm

Actually one of the good thing about IPFS is that you don’t need much bandwidth, since everyone that pin the data set on a IPFS client basically donate bandwidth for distributing the data set. If is a peer to peer system, that works similar to bittorrent.

nukeador · April 29, 2020, 11:06am

Sorry, by “bandwidth” I mean “staff bandwidth time” to officially analyze, plan and implement this solution at this point.

Having said that, we will take this into consideration once we evaluate improvements on our dataset publication. Thanks for the information.