Accessing audio files by only accent (or other features)?

Bhavishya_K_H · June 4, 2021, 12:44pm

Is there any way to download the data only based on certain conditions? For example, if I only needed the indian accented english corpus, I could add so as an option in load_dataset()? Since some of the datasets are large and there has got to be a way to pick out a sample?

ftyers · June 4, 2021, 6:57pm

You can just create your own subcorpus using the TSV. There is a column for accent.

Bhavishya_K_H · June 5, 2021, 12:47am

Can you provide me with a link/example on how to do that if possible? Thanks!

ftyers · June 5, 2021, 2:06pm

$ cat train.tsv | grep -P '(\tindian\t|^client_id)' > train-indian.tsv

That should do it.