Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools to prune cache #3

Open
hadley opened this issue Sep 15, 2018 · 9 comments
Open

Tools to prune cache #3

hadley opened this issue Sep 15, 2018 · 9 comments
Labels
feature a feature request or enhancement

Comments

@hadley
Copy link
Member

hadley commented Sep 15, 2018

I don't think we need it in this version, but I think the next version should have some way to automatically prune the cache to keep it below a user specified threshold with default (maybe controlled via an environment variable, and set to say 5 Gb by default?)

@gaborcsardi
Copy link
Member

Good point. Agreed. I did not include it, because it requires some thinking....

@hadley
Copy link
Member Author

hadley commented Sep 16, 2018

@wch just thought this through for shiny caching, so is likely to have good ideas.

@gaborcsardi
Copy link
Member

One difficulty is that we would probably need to store the last-access time stamp, we definitely don't want to remove packages that were used not long ago. For this we need to lock the cache, with an exclusive lock, which is not ideal.

@hadley
Copy link
Member Author

hadley commented Sep 16, 2018

Do you think it's too unreliable to use the file system last access time?

@gaborcsardi
Copy link
Member

gaborcsardi commented Sep 16, 2018

Yeah, we can try that as well, but yeah, still not always reliable. In particular, AFAICT usually not available in Docker containers. I think it is also not always enabled on Windows.

But we can figure something out, probably, e.g. have a separate lock for the access times.

@gaborcsardi
Copy link
Member

Although maybe you don't want to prune in Docker containers, anyway. But windows is still an issue, and in general it is just too platform dependent to rely on.

@wch
Copy link
Member

wch commented Sep 17, 2018

The atime attribute can't be relied on in general. In Linux, it's not unusual to mount a filesystem with noatime. On some filesystems, the time resolution is poor (for FAT, the time resolution is one day, and for HFS+, I think it's ). On NTFS in Windows, atime has a 100 ns resolution, but it is updated only once per hour.

I ran a bunch of tests on mtime, ctime, and atime here: https://gist.github.com/wch/9bc615c70219c7ac15f7b339ddd7a30d

The solution I ended up using was to use mtime, which seems to work reliably across platforms, and call Sys.setFileTime() each time I accessed the file: https://github.com/rstudio/shiny/blob/8c9ce19/R/cache-disk.R#L290

Note that Sys.setFileTime() apparently updates mtime, ctime, and atime on some platforms and on others (Windows-NTFS in my testing) only updates mtime.

The disk caching and pruning code in the link above is designed to work when multiple processes are using the same directory to store objects, so no locking is required (there are some potential races, but all are recoverable, since it's just a cache). All the relevant state for the objects (name, time, size, and the content) is stored on the filesystem, so you can stop an R process that uses the directory for a cache, then start another one and point it to the directory, and it will continue to work fine.

@gaborcsardi
Copy link
Member

@wch Thanks!

@gaborcsardi gaborcsardi added the feature a feature request or enhancement label Oct 8, 2020
@gaborcsardi
Copy link
Member

Note: I am going to postpone this until we have a database backend, to avoid having to rewrite it then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants