Discussion: We need a better solution for the clips directory

bozden · March 25, 2024, 2:37pm

In any file system on the planet, having too many small files in one directory is not a good idea. Some are better than the alternative, but in the end it drops the performance - at the operating system level.

Many systems which save large amount of small data (one example is caching) implement methods like:

Keep the data in a database
Keep it in a large file and implement an indexed access
Use a tree of directory structures (e.g. 16-256 per level)
…

When working with datasets, we need to un-tar them to access individual clips and when the dataset is large, so is the clips directory. When you examine the un-tar process of a larger dataset, you will see how it gets slower and slower, because the OS/File System pair cannot cope with it, especially if they keep journals. For those large datasets, it is also not possible use RAM-disks, you would need workstation grade hardware for large RAM (and RAM is already a scarce resource when training).

I’m not sure how you cope with this issue, but I tried several operating systems with different file systems, each get a bottleneck at some point. Of course HHD < SSD < NVMe SSD etc, but you would need large, speedy and costly ones (e.g. 4TB NVMe on PCIE-4/5).

And it will get worse with increasing dataset sizes…
And it will be an addition to the performance cost caused by accessing small files instead of large ones (files are usually between 3KB-90KB for now, but will reach 130KB with the new 15sec recording limit).

The easiest solution I can think of is #3 above, e.g. creating a directory structure based on the number in the clip name, e.g. for common_voice_ab_19904194.mp3 like:

./clips
  /19
    /90
      /41
         common_voice_ab_19904194.mp3

That would keep each directory under 100 items and the path calculations would be very easy…

What is your experience? What do you think?

bozden · June 1, 2024, 12:33pm

In the last two months, I implemented many ideas, tested them, and found out even the multi-level directory structure is not a solution, although it eases the problem. The main problem is having many small files and the OS overhead opening/closing them.

The solution is Apache Parquet format.

I’m changing my whole workflow to this:

Traverse the downloaded .tar.gz files without expanding them to files, collecting files in memory (Python).
Write metadata files to the filesystem (for now)
Collect clips in batches, transcode them (16kHz mono mp3 - this makes the data smaller and saves time for later steps in training), and save them in hierarchical/partitioned & compressed format (hive), so that each file is 1 GB at max.

datasets
   clips.parquet
       lc={xx}
          ver={yy}
             batch_{zz}_part_{i}.parquet

This way, I can keep all data in one place and access them rapidly with SQL like queries. It takes some time to “import” them into this structure, but later access is only limited by the disk bandwidth.

Many dataset providers also use Apache Parquet, as these files are columnar, and also support cloud drives such as S3. Maybe CV can also consider this for future releases.

I hope this helps some…