Download datasets in Ubuntu VPS through terminal

I want to download the datasets for training AI model, but I can’t do that in Ubuntu VPS. Any help would be very appreciated.

Best.

AFAIK, Common Voice does not want you to download them directly though a link in a script and/or programmatically. You have to give your email and click the * You agree to not attempt to determine the identity of speakers in the Common Voice dataset checkbox, which makes it legally binding. At that time a unique security token is generated and is valid for a while, and it is used by the Google Cloud’s backend. Sometimes it might disconnect and you should restart, or better you use a downloader which continues where it is disconnected (like FTP).

Is it not possible for you to download the dataset locally and upload it manually to the VPS?

1 Like

Thank you for your response. I tried to download and upload to the VPS, but it was continuously disconnecting. I will try to use downloader.
So downloading datasets directly using link is truly prohibited by Mozila?

I’m not sure how it works on the new Google Cloud backend. In the past some were able to DL into Google Drive using Colab.

I can think of a couple of reasons for this:

  • The above mentioned legal binding,
  • Getting DL statistics (internal purposes and reporting)
  • Prevent abuses (e.g. multiple parallel downloads which suck the bandwidth, prevent mirroring of datasets etc)

I use Free Download Manager from local and it works, usually saturates my BW. One problem is: Most people have asymmetric connections, so uploading is much slower, a pain especially for large datasets (mine is 100/20 Mbps).

From the link I can see that it is valid for 12 hours. But for medium / large datasets, I always get many reconnects after the move to Google Cloud. Before (when it was on Amazon) the duration was much more limited, but the connection was more stable.