En 22G dataset, problems about 'path' in .tsv files

I have downloaded the 22G English dataset. But when checking the .tsv files, I just can’t find the ‘path’ files in ‘clips’ folder. Can someone tell me what is the relationship between ‘path’ in .tsv files and the mp3 files in ‘clips’ folder?

Each ‘path’ in the .tsv files should map to a file named clips/<path>.mp3. What do you see in the clips directory when you untar the download?

Files in the clips folder have a long string name but without the file extension ‘.mp3’.

I used the ‘find -name’ in command line in the clips floder, using the path name given in the .tsv files. But nothing returned.

Just a couple quick questions:

  1. What OS are you on?
  2. what command and options did you use to untar the ‘en.tar.gz’ file that you downloaded?
  3. can you post one of the filenames you are seeing in the ‘clips’ folder (any of them will do)

OS: ubuntu LTS 16.04
Command: unar en.tar
sample in ‘clip’:
0a0a3dc82cf1fc6d575f382c2902b9d0c98e51bfe6e27d12c6831127f9e8cfb6884601493a50dfa9422e436d72aa7baac7d

I just reproduced your issue by running unar. I don’t regularly use that, but it may have some issue with the file name lengths. I know tar works:
tar -xvf <download file>

I used ’tar -xvf‘ but it seems that something was wrong.
I got some informations like this:
clips/497878c1e38f17c2959173982c51b35834ea1222f9685a730d5d85ed7a8aa5e1d8677a8dcf2f9c092635d6a303084d23b34b023acae0e841ba38f843b75c5ecd.mp3
tar: Ignoring unknown extended header keyword ‘SCHILY.dev’
tar: Ignoring unknown extended header keyword ‘SCHILY.ino’
tar: Ignoring unknown extended header keyword ‘SCHILY.nlink’

I think there’s something wrong with my download zip file.

The tar was likely created on a mac. You can ignore those. You should now see that the ‘clips’ directory has all the correct files. Have fun with the dataset!

I re downloaded the files and got the correct files with ‘tar’ commond.Thanks for your advice!