Accessing audio files by only accent (or other features)?

Is there any way to download the data only based on certain conditions? For example, if I only needed the indian accented english corpus, I could add so as an option in load_dataset()? Since some of the datasets are large and there has got to be a way to pick out a sample?

1 Like

You can just create your own subcorpus using the TSV. There is a column for accent.

Can you provide me with a link/example on how to do that if possible? Thanks!

$ cat train.tsv | grep -P '(\tindian\t|^client_id)' > train-indian.tsv 

That should do it.