Longer-lasting caches with generic-worker

ricky26 · March 12, 2021, 9:10pm

At the moment, generic-worker supports cache files & directories, with them optionally being pulled from Taskcluster artefacts.

However, for our client builds, we have large amounts of cache (the Unity project’s Library directory for example, which can count tens of gigabytes). I don’t think it makes that much sense for us to upload it as an artefact (compression + uploading + downloading would cut into the benefits quite a way).

On cloud providers though, generally we can attach multiple disks and change the attachments whilst the machine is running. This is the case for at least GCE.

Which means, that we could potentially make it so that cache directories in generic-worker are implemented as GCE persistent disks (for example) instead of local directories in this situation. This would increase the chance of cache hits, especially when the build machines themselves are short lived.

Roughly, I guess this could be implemented like so:

Use labels in the cloud provider to mark & locate cache disk volumes.
Add APIs to worker-manager for attaching cache volumes to workers. (This mitigates the need for workers to have credentials for dealing with the cloud provider.)
Add support in generic-worker to use this API (and then potentially prepare/ mount the given volume).
Use bind mounts for cache directories (or junctions on Windows) to minimise copying of the cache directory.

It could have some interesting caveats (for example, copying/moving out of cache directories would be a fair bit slower since they’d be going over an FS boundary).

This was mostly fallout from a discussion I had with @pmoore but I’d be very interested to see if anyone else had any thoughts. (Some of this was a follow-on from other bits and pieces of conversations I’ve had about having longer-lived caches.)

If anyone has any other thoughts, I’d love to hear them!

aki1 · March 16, 2021, 7:35pm

It does sound like larger, long-lived caches would make sense here.

My main concern is around permissioning: as long as we require certain scopes to mount a cache, we should be good. I’m not sure if we want read-only vs read-write scopes or just use the latter. By having different sets of scopes per cache, we can avoid accidental or intentional poisoning of caches by other projects’ tasks.

ricky26 · March 16, 2021, 7:46pm

I do think we can evade worrying too much about scopes though.

generic-worker already requires scopes for mounting directory caches. Cloud disk caches could just use that same API (effectively being an implementation detail).

IIUC they provide the same semantics. The only questionable part is the potential expectation that the directory cache is on the same physical device as it usually is now (this can already be overridden).

You’d probably also have to add a maxSize to directory caches to allow them to be on disks too, since you have to specify a size on creation.

dmitchell · April 7, 2021, 4:16pm

I appreciate the desire for caches that last longer than a single vm.

However, the idea of dynamically attaching volumes to instances seems quite difficult to get right. Just starting and monitoring vms is shockingly difficult, and that’s the #1 way the cloud vendors make money! We’ve done best by building functionality that is not too closely tied to any particular cloud technology.

I see that all three major clouds support network filesystems (both NFS and SMB). Would that be an option here? I assume that those filesystems perform a similar level of caching to that of EBS/Cloud Disks (that is, cache blocks of data near the instance on first read). That would nicely support both shared access to data (either read-only, or in cases where multiple tasks can safely read and write the data at the same time, read-write). And, with a bit of attention to locking in the worker implementation, it could support exclusive caches as well, with the worker using that locking to “claim” a directory for its cache.

That suggestion limits the complexity of the implementation to the worker (or even to the launch spec in the workerPool definitions, without any change on the worker), rather than including worker-manager.

Access control might be a bit more crude – likely any instance that can access a shared volume would have to be trusted to handle that volume properly – but maybe that’s ok?