In any file system on the planet, having too many small files in one directory is not a good idea. Some are better than the alternative, but in the end it drops the performance - at the operating system level.
Many systems which save large amount of small data (one example is caching) implement methods like:
- Keep the data in a database
- Keep it in a large file and implement an indexed access
- Use a tree of directory structures (e.g. 16-256 per level)
- …
When working with datasets, we need to un-tar them to access individual clips and when the dataset is large, so is the clips directory. When you examine the un-tar process of a larger dataset, you will see how it gets slower and slower, because the OS/File System pair cannot cope with it, especially if they keep journals. For those large datasets, it is also not possible use RAM-disks, you would need workstation grade hardware for large RAM (and RAM is already a scarce resource when training).
I’m not sure how you cope with this issue, but I tried several operating systems with different file systems, each get a bottleneck at some point. Of course HHD < SSD < NVMe SSD etc, but you would need large, speedy and costly ones (e.g. 4TB NVMe on PCIE-4/5).
And it will get worse with increasing dataset sizes…
And it will be an addition to the performance cost caused by accessing small files instead of large ones (files are usually between 3KB-90KB for now, but will reach 130KB with the new 15sec recording limit).
The easiest solution I can think of is #3 above, e.g. creating a directory structure based on the number in the clip name, e.g. for common_voice_ab_19904194.mp3 like:
./clips
/19
/90
/41
common_voice_ab_19904194.mp3
That would keep each directory under 100 items and the path calculations would be very easy…
What is your experience? What do you think?