Is there any way to download the data only based on certain conditions? For example, if I only needed the indian accented english corpus, I could add so as an option in load_dataset()? Since some of the datasets are large and there has got to be a way to pick out a sample?
1 Like
You can just create your own subcorpus using the TSV. There is a column for accent.
Can you provide me with a link/example on how to do that if possible? Thanks!
$ cat train.tsv | grep -P '(\tindian\t|^client_id)' > train-indian.tsv
That should do it.