Objective JH-7: Improve processes around resource cleanups #22

batpad · 2024-04-16T23:16:16Z

Motivation

We offer persistent storage to users on the hub
We don’t have an effective way to regularly clean-up unused home directories
Users sometimes use HOME directories for temporary data storage which can consume a lot of space without being useful to have around

Owner(s)

@batpad @sunu @yuvipanda (support from @wildintellect + @j08lue )

Success criteria

We have a well defined process for admins to go through user HOME directories and make decisions on what needs to be cleaned up
We have a well defined cadence for cleaning up HOME directories and setup guidelines to set user expectations around data persistence
If it seems feasible and desirable, we have a plan to completely automate clean-ups in the future.

yuvipanda · 2024-04-19T04:40:08Z

Also /cc @jbusecke who has been trying to get this done for a while as well.

jbusecke · 2024-04-19T15:13:59Z

Thanks for looping me in here. This is indeed quite desperately needed for our growing community! Happy to help testing.

yuvipanda · 2024-04-19T18:17:52Z

@jbusecke what do you think of the proposal in #15 (comment)

jbusecke · 2024-04-19T20:27:29Z

Make a grafana dashboard that lists users home directories, the last time they were modified, how big they are, etc as a table. This data is already collected by https://github.com/yuvipanda/prometheus-dirsize-exporter. I'll provide the JSON for the grafana dashboard below. It's a 'dirty' export from one particular grafana, and would need to be made as a PR to https://github.com/jupyterhub/grafana-dashboards. This would allow it to show up in the appropriate grafana for the hubs (VEDA, GHG, etc)

Enable the allusers directory. This puts an allusers directory in the home directory of admin users. Perhaps it can be put somewhere other than $HOME as well, so people don't accidentally delete everyone's home directories? Regardless, this would give access to admins to be able to go cleanup people's home directories. I think this is a good time for you to try make a PR to the infrastructure repository following that link, @batpad - and we can work through any issues there.

This still leaves the hub admin with several steps, this one being fairly risky (as you mention). Ideally I would like to just go to the admin hub panel, delete a user (or even better do that via an API call), and then all of the offboarding things are happening automatically?

Most importantly, now there needs to be an actual policy here. And this needs to be communicated to the users. What actually happens? Do their files get deleted? Are they notified? If so, how? This is actually the hard part.

Agreed. Here is what I would like to happen in order of increasing awesomeness (but probably also increasing effort for you):

Like ☝️ delete the data of a given user, while also removing them from the hub.
Move (not delete) the users data into a 'trashcan' setup (user storage with e.g. 30 day retention). This would at the very least 'notify' the user by them not being able to access their home dir, but leave us with a (manual option) to retrieve the data for a while
Move data to a trashcan setup, and automatically notify them with a way to request restoration of their data to the main hub.
Do all of the above automatically when a user is removed from all relevant github teams. At LEAP we have fully automated the user sign up by ingesting a spreadsheet with usernames and dates, and then signing them up via github API call. If we could do the same with offboarding, that would be amazing (and actually enable us to pop off an email to that member, since we have gh usernames and emails colocated in that csv).

Sorry this might be a bit rambly...long day.

batpad · 2024-04-19T23:10:17Z

@jbusecke this is all super great, thank you!

This really helps inform what our medium term plan around this should be. The "Admin look at Grafana dashboard and manually cleanup" was definitely a stopgap idea and what you have articulated here seems like a really good, concrete plan for how we would want to manage this.

I'll discuss implementation of these with @yuvipanda and @sunu and see what we can reasonably sketch out, both for the shorter and medium term and amend / add tickets accordingly and keep you looped in.

Thanks again for the inputs here!

batpad · 2024-05-30T07:26:48Z

The OpenScapes folks have some really good documentation around this and also some scripts to perform archiving / resource cleanups: https://github.com/NASA-Openscapes/2i2cAccessPolicies?tab=readme-ov-file#data-storage-in-the-nasa-openscapes-hub - Thanks to @ateucher !

The overall policies and retention periods seem sensible to me and the scripts to automate cleanups look really good to me.

@sunu - it would be great to take a look at this with you next week and see what we need to get setup to start performing these tasks on the hubs we manage.

For a medium to long-term solution, it would be great to have a common solution that can be easily applied to "all" hubs. I'm having a hard time thinking of a "good" solution that does not involve building some UI to let hub admins manage these tasks, as @jbusecke describes above.

So far, I see a few work-flows around the home directory cleanups:

Time-based: eg. Automatically archive home directories after 6 months of non-usage, etc. This is very well described in the OpenScapes document.
Event-based: as described by @jbusecke above - some users were added for a workshop / event, and as part of offboarding the users, their HOME directories should be automatically archived or deleted.
Alert-based: We may want to set some thresholds and inform an admin user for cases of unusually high HOME directory usage based on policies set by the hub admins

Trying to think of what an "MVP" here could look like, that would already be broadly useful:

As part of hub configuration, we can set time-based rules for automated home directory clean-ups. The archive-home-dirs.py script developed by @ateucher and @yuvipanda looks really great. I think the idea to archive to s3 as a first step is great, so the action is reversible. Let's do this manually for now to get a feel of things, but would be great to think about eventually automating via hub configuration.
It does seem like there's a pretty strong use-case for "make it easy for a Hub admin to pass in a username / list of usernames and perform clean-ups on those user home directories". Admins can probably choose whether to archive and delete, or just delete, etc. @yuvipanda does this seem feasible as something to think about to build into the server admin UI somewhere?
"Alerting rules and notification": It seems like it might be nice to be able to configure some thresholds of usage to get alerted on, but maybe that's a pipe dream? Perhaps we can start with figuring out and better document process for looking at Grafana dashboards and identifying things that may need manual intervention to clean-up?

The archive-home-dirs script seems like a great start - we'll work on this process a bit manually to get a better feel for things to do cleanups initially, but I do think it would be great to work on some of the automation and Admin UI bits soon.

I think it's likely good to let this marinate a bit and see if there's other things being worked on or other thoughts before diving into things here.

ateucher · 2024-05-30T17:20:35Z

Thanks for this @batpad! I can take very little credit for the archive-home-dirs.py script, that was 99% @yuvipanda :)

As for the process, a few things jump out to me (mostly in agreement with you):

The allusers directory always present in my home directory does make me a bit nervous, it would be nice to be able to log in as an admin or regular user based on what I want to do in there.
It's nice that the 6 month auto-removal of archives in S3 is handled by a delete-after-expiry lifecycle rule. The S3 bucket we archive to has another lifecycle rule that transitions to Glacier Instant Retrieval after 3 days, so that storage is really cheap. Also thanks to @yuvipanda I believe!
As mentioned we don't currently have a way to alert users (other than manual email) about the archival of their dir, nor how to request restoration. It could be done in a GitHub issue, but that seems maybe a bit too public.
The manual process of identifying home dirs to archive, and then doing it with the script is still a bit undefined. We decided on the 6 month policy but haven't laid out exactly what the process is.
We also need to align the home dir cleanup policies with policies for removing hub access (by removing from appropriate GitHub teams) - this will look different for different categories of users (single workshop, learning cohorts, long-term access).
I am working on a monthly usage report that pulls together usage data from Grafana (prometheus) and cost data from AWS so we can monitor that. I'll update here when it's futher along.
One quirky thing is that home directory names go through some character escaping, which appears to replace any - with -2d so linking home dir names to GitHub usernames takes some machinations.

Also FYI, the access policy doc will be moved to the NASA-Openscapes cookbook soon. I will be sure to leave a link in the README to the new home.

ateucher · 2024-05-30T20:09:23Z

This seems relevant here too: 2i2c-org/infrastructure#4159

ateucher · 2024-05-30T20:11:18Z

One other thought is that at the same time we start to enforce home directory usage, we need to give users the tools and skills to use alternatives - i.e., get comfortable using S3 for storage.

batpad · 2024-05-31T04:30:18Z

One other thought is that at the same time we start to enforce home directory usage, we need to give users the tools and skills to use alternatives - i.e., get comfortable using S3 for storage.

I do think the biggest benefit we can get is from trainings and documentation. I imagine there are a few actual use cases for using the persistent home directory storage, but for a LOT of use-cases, either:

The user was doing some temporary file processing and just did not clean up. Temp file processing should happen using /tmp/ or so - this will also be faster for the user, as well as be automatically cleaned up when the container is shut down.
The user wants to archive some outputs, etc. - for which, as you say, s3 should be used.

Would be great to have these well written out with practical examples as something that can be shared with users as part of trainings / hub onboarding.

This seems relevant here too: 2i2c-org/infrastructure#4159

This ticket is great! Thank you @ateucher @yuvipanda :-) - that perfectly covers the "Alert-based" use-case I mention above. Happy to discuss and see if there's something we can do to help moving that forward.

ateucher · 2024-05-31T17:07:13Z

💯 @batpad.

We have a basic tutorial on storing data in the $SCRATCH_BUCKET and $PERSISTENT_BUCKET here: https://nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/using-s3-storage.html. We presented this in one of our cohort calls but haven't seen much evidence of their use yet. It's still missing guidance on using /tmp/ though.

wildintellect · 2024-05-31T22:29:39Z

Indirectly some use cases are outlined in the MAAP docs https://docs.maap-project.org/en/latest/getting_started/getting_started.html#MAAP-Storage-Options
Specifically

Code should be in version control,
large data you need to keep in buckets (whose bucket is another issue), MAAP provides buckets, I think the only reason people use them is they are auto-mounted to appear as folders.
Anything else left in home should be expected to vanish at any time.
I've never seen /tmp described in any of the cloud based notebooks (/tmp and /scratch are pretty traditional on HPC) but they really should be. scratch aligning much most closely with a SCRATCH bucket, but I wouldn't use buckets for that, performance would be terrible.

yuvipanda · 2024-06-25T04:11:12Z

Just a note that we're going to be upstreaming the homedirectories report grafana dashboard. This works fine, and I'll finish up a bit of docs and get this up later this week: https://github.com/jupyterhub/grafana-dashboards/compare/main...yuvipanda:grafana-dashboards:homedirs?expand=1

batpad added the PI 24.3 Q2, 2024 label Apr 16, 2024

batpad changed the title ~~Objective 7: Improve processes around resource cleanups~~ Objective JH-7: Improve processes around resource cleanups Apr 18, 2024

batpad assigned batpad, sunu and yuvipanda Apr 18, 2024

batpad closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Objective JH-7: Improve processes around resource cleanups #22

Objective JH-7: Improve processes around resource cleanups #22

batpad commented Apr 16, 2024

yuvipanda commented Apr 19, 2024

jbusecke commented Apr 19, 2024

yuvipanda commented Apr 19, 2024

jbusecke commented Apr 19, 2024

batpad commented Apr 19, 2024

batpad commented May 30, 2024

ateucher commented May 30, 2024 •

edited

Loading

ateucher commented May 30, 2024

ateucher commented May 30, 2024

batpad commented May 31, 2024

ateucher commented May 31, 2024

wildintellect commented May 31, 2024

yuvipanda commented Jun 25, 2024

Objective JH-7: Improve processes around resource cleanups #22

Objective JH-7: Improve processes around resource cleanups #22

Comments

batpad commented Apr 16, 2024

Motivation

Owner(s)

Success criteria

yuvipanda commented Apr 19, 2024

jbusecke commented Apr 19, 2024

yuvipanda commented Apr 19, 2024

jbusecke commented Apr 19, 2024

batpad commented Apr 19, 2024

batpad commented May 30, 2024

ateucher commented May 30, 2024 • edited Loading

ateucher commented May 30, 2024

ateucher commented May 30, 2024

batpad commented May 31, 2024

ateucher commented May 31, 2024

wildintellect commented May 31, 2024

yuvipanda commented Jun 25, 2024

ateucher commented May 30, 2024 •

edited

Loading