Failed to download commonvoice dataset on Linux

Dear all,

I am having trouble downloading datasets using wget on Linux. The command and error message are as follows:

I will paste my command and error message in the comment section because the website says I cannot post them cause there are too many links in it...

Seems there is something wrong with the aws signature? Anyway, anyone has any idea how to fix this problem?

Besides, I also tried to download the datasets on some software on Windows, but the problem is that it seems the download link would expire if I pause the download for a while. Last night I paused the download when I left my office, and the next day it fails to continue the download when I came to work.

The datasets are very large so I really hope it could be downloaded in the Linux environment on the server. And I really want to know how to avoid such failures in the future. Thank you very much.

My command.

(base) kai@sgpu:/data/german$ wget -c https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-9.0-2022-04-27/cv-corpus-9.0-2022-04-27-de.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAQ3GQRTO3BGARVCXX%2F20220601%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220601T015102Z&X-Amz-Expires=43200&X-Amz-Security-Token=FwoGZXIvYXdzEFMaDIJKmlQonf2p8sIGsSKSBGE1oclpWMiOoOsa1hWF0iHBYsSj0TSmYQ38jnCuxVQM8xjBptAeqkWXYj5tcbh7FnXrMSpK99xY3DsVlZF4z%2Fclw975AqVo4eKDSs6yQjLj83lwqJcADlHPnnWNnH3MEuw5mBGEnKoDXq%2Bh6fPc2qI3mhIqkHsC6cc7nw7AhrTkl8N5%2FwFgg7yo%2FnDGcMvkYYJ5ZhpiyWEpxTPzH%2BVFQOgD03%2FEwpDAWKRKCtukSxbHewRNEyZov65CKYgqAIlSY73THuZU2%2FkFGlloqn1So0PT8QJ%2Bw2koEVHJVp95CM63VNj3xKI7BMgndCelDi8ab2uCe%2FsZpLPjAiEtMkGbnWI8UQqmtE1iyAr8f%2FzBacIdeklqygWwJMGNhuqslyezP2m%2BhoYSwZASiHXuLOEq5QZlpspTSq593VtmvMR8GIq2BFkPSEGXrhkrxHQG5VXtTc77g9nBZtSli0loRX6S5qnnQ1btCYI%2BoNJnI9MYApTJOgzC9FYxLMa97V%2BOiXwaF3z56h3r%2FlcGLd6RIuyOcTBf0qk7%2BIkbDT7QSqlUGwk34WgkGO6KLMdH%2BrWFRMSYdKOmpGTV2KBlusW1gsXuRouzMaui10lMDoWbi%2B7K%2B2%2FHKedLQS8ynBK7CrEe1stmjEKiCWMMC%2F48l6krSOa7xEdONgub%2B2U%2BZhn1ayTrqkB%2BwuZYNXeMKPGPj6zKZwB8Xh8nKLH92pQGMirTXh2Hx9Iwxq3x0kjRh7kIsXmKWwWAMthuxIcaiH%2FPixKJQ4meJ98AEW4%3D&X-Amz-Signature=c34c71472c16865d3ce7bdd3244cf20b375f1f5434bf350d3d71afb7ed4ffadb&X-Amz-SignedHeaders=host

I started the download after I typed in my email address on the website, cancelled the download, and then pasted the download link. It should be a valid download link because it was connected and downloaded properly when I pasted them into other download tools on Windows. I don’t get why it failed using wget on Linux.

The error message.

`[1] 108755
[2] 108756
[3] 108757
[4] 108758
[5] 108759
[6] 108761
–2022-06-01 10:10:22-- https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-9.0-2022-04-27/cv-corpus-9.0-2022-04-27-de.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256
Resolving mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com (mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com)… X-Amz-Expires=43200: command not found
X-Amz-SignedHeaders=host: command not found
X-Amz-Date=20220601T015102Z: command not found
[3] Exit 127 X-Amz-Date=20220601T015102Z
[4] Exit 127 X-Amz-Expires=43200

X-Amz-Credential=ASIAQ3GQRTO3BGARVCXX%2F20220601%2Fus-west-2%2Fs3%2Faws4_request: command not found
X-Amz-Signature=c34c71472c16865d3ce7bdd3244cf20b375f1f5434bf350d3d71afb7ed4ffadb: command not found
X-Amz-Security-Token=FwoGZXIvYXdzEFMaDIJKmlQonf2p8sIGsSKSBGE1oclpWMiOoOsa1hWF0iHBYsSj0TSmYQ38jnCuxVQM8xjBptAeqkWXYj5tcbh7FnXrMSpK99xY3DsVlZF4z%2Fclw975AqVo4eKDSs6yQjLj83lwqJcADlHPnnWNnH3MEuw5mBGEnKoDXq%2Bh6fPc2qI3mhIqkHsC6cc7nw7AhrTkl8N5%2FwFgg7yo%2FnDGcMvkYYJ5ZhpiyWEpxTPzH%2BVFQOgD03%2FEwpDAWKRKCtukSxbHewRNEyZov65CKYgqAIlSY73THuZU2%2FkFGlloqn1So0PT8QJ%2Bw2koEVHJVp95CM63VNj3xKI7BMgndCelDi8ab2uCe%2FsZpLPjAiEtMkGbnWI8UQqmtE1iyAr8f%2FzBacIdeklqygWwJMGNhuqslyezP2m%2BhoYSwZASiHXuLOEq5QZlpspTSq593VtmvMR8GIq2BFkPSEGXrhkrxHQG5VXtTc77g9nBZtSli0loRX6S5qnnQ1btCYI%2BoNJnI9MYApTJOgzC9FYxLMa97V%2BOiXwaF3z56h3r%2FlcGLd6RIuyOcTBf0qk7%2BIkbDT7QSqlUGwk34WgkGO6KLMdH%2BrWFRMSYdKOmpGTV2KBlusW1gsXuRouzMaui10lMDoWbi%2B7K%2B2%2FHKedLQS8ynBK7CrEe1stmjEKiCWMMC%2F48l6krSOa7xEdONgub%2B2U%2BZhn1ayTrqkB%2BwuZYNXeMKPGPj6zKZwB8Xh8nKLH92pQGMirTXh2Hx9Iwxq3x0kjRh7kIsXmKWwWAMthuxIcaiH%2FPixKJQ4meJ98AEW4%3D: command not found
52.218.181.17, 2600:1fa0:4080:9380:34da:f231::
Connecting to mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com (mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com)|52.218.181.17|:443… connected.
HTTP request sent, awaiting response… 400 Bad Request
2022-06-01 10:10:28 ERROR 400: Bad Request.`

Oh, I have solved the problem. Basically I just need to keep the copied link till ‘tar.gz’ and ignored the rest of the link. The following command works.

$ wget -c https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-9.0-2022-04-27/cv-corpus-9.0-2022-04-27-de.tar.gz --2022-06-01 15:02:20-- https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-9.0-2022-04-27/cv-corpus-9.0-2022-04-27-de.tar.gz

Besides, I don’t recommend people directly download commonvoice datasets that are too large like 20+GB, because in the middle of your downloading it might stopped and might not be able to continue as the link has expired.

Hi @Kai1 and welcome to the community :wave:

I’m glad you could solve the problem.

I think the source of the problem was that your shell wasn’t happy about the & inside the URL. It considered the & as the end of the URL and executed the next parts as shell commands. That generated errors like X-Amz-SignedHeaders=host: command not found.
Using double or single quotes around the URL should prevent that if you every get into similar situations. :slightly_smiling_face:

Have a nice day,
Michael

1 Like

It is not a good practice to download the dataset in a training script for example, it will be waste of time and resources. Perhaps except for downloading automation (e.g. I use it in Colab to save to my Google Drive).

For local downloads, it is best to use a download manager, which can continue if a line drop happens.

1 Like