Partial datasets for limited space, noisy datasets / creation

Hey, I’m planning to do an experiment for noise injection training with three different magnitudes for the English language. Unfortunately, my disk space is limited so I’m looking for a dataset to be used with deep speech under between 5 - 15 GB.

Also, is it possible to use only part of the English common voice dataset?
If yes how can I do that?

For inference testing do I need to find certain noisy sets or can I create it from the common voice dataset by making them noisy myself?

Some advice on this one would also be appreciated, Thanks in advance.