Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objective JH-7: Improve processes around resource cleanups #22

Closed
3 tasks
batpad opened this issue Apr 16, 2024 · 13 comments
Closed
3 tasks

Objective JH-7: Improve processes around resource cleanups #22

batpad opened this issue Apr 16, 2024 · 13 comments
Assignees
Labels
PI 24.3 Q2, 2024

Comments

@batpad
Copy link
Collaborator

batpad commented Apr 16, 2024

See also:

Motivation

  • We offer persistent storage to users on the hub
  • We don’t have an effective way to regularly clean-up unused home directories
  • Users sometimes use HOME directories for temporary data storage which can consume a lot of space without being useful to have around

Owner(s)

@batpad @sunu @yuvipanda (support from @wildintellect + @j08lue )

Success criteria

  • We have a well defined process for admins to go through user HOME directories and make decisions on what needs to be cleaned up
  • We have a well defined cadence for cleaning up HOME directories and setup guidelines to set user expectations around data persistence
  • If it seems feasible and desirable, we have a plan to completely automate clean-ups in the future.
@batpad batpad added the PI 24.3 Q2, 2024 label Apr 16, 2024
@batpad batpad changed the title Objective 7: Improve processes around resource cleanups Objective JH-7: Improve processes around resource cleanups Apr 18, 2024
@yuvipanda
Copy link
Collaborator

Also /cc @jbusecke who has been trying to get this done for a while as well.

@jbusecke
Copy link

Thanks for looping me in here. This is indeed quite desperately needed for our growing community! Happy to help testing.

@yuvipanda
Copy link
Collaborator

@jbusecke what do you think of the proposal in #15 (comment)

@jbusecke
Copy link

Make a grafana dashboard that lists users home directories, the last time they were modified, how big they are, etc as a table. This data is already collected by https://github.com/yuvipanda/prometheus-dirsize-exporter. I'll provide the JSON for the grafana dashboard below. It's a 'dirty' export from one particular grafana, and would need to be made as a PR to https://github.com/jupyterhub/grafana-dashboards. This would allow it to show up in the appropriate grafana for the hubs (VEDA, GHG, etc)

Enable the allusers directory. This puts an allusers directory in the home directory of admin users. Perhaps it can be put somewhere other than $HOME as well, so people don't accidentally delete everyone's home directories? Regardless, this would give access to admins to be able to go cleanup people's home directories. I think this is a good time for you to try make a PR to the infrastructure repository following that link, @batpad - and we can work through any issues there.

This still leaves the hub admin with several steps, this one being fairly risky (as you mention). Ideally I would like to just go to the admin hub panel, delete a user (or even better do that via an API call), and then all of the offboarding things are happening automatically?

Most importantly, now there needs to be an actual policy here. And this needs to be communicated to the users. What actually happens? Do their files get deleted? Are they notified? If so, how? This is actually the hard part.

Agreed. Here is what I would like to happen in order of increasing awesomeness (but probably also increasing effort for you):

  • Like ☝️ delete the data of a given user, while also removing them from the hub.
  • Move (not delete) the users data into a 'trashcan' setup (user storage with e.g. 30 day retention). This would at the very least 'notify' the user by them not being able to access their home dir, but leave us with a (manual option) to retrieve the data for a while
  • Move data to a trashcan setup, and automatically notify them with a way to request restoration of their data to the main hub.
  • Do all of the above automatically when a user is removed from all relevant github teams. At LEAP we have fully automated the user sign up by ingesting a spreadsheet with usernames and dates, and then signing them up via github API call. If we could do the same with offboarding, that would be amazing (and actually enable us to pop off an email to that member, since we have gh usernames and emails colocated in that csv).

Sorry this might be a bit rambly...long day.

@batpad
Copy link
Collaborator Author

batpad commented Apr 19, 2024

@jbusecke this is all super great, thank you!

This really helps inform what our medium term plan around this should be. The "Admin look at Grafana dashboard and manually cleanup" was definitely a stopgap idea and what you have articulated here seems like a really good, concrete plan for how we would want to manage this.

I'll discuss implementation of these with @yuvipanda and @sunu and see what we can reasonably sketch out, both for the shorter and medium term and amend / add tickets accordingly and keep you looped in.

Thanks again for the inputs here!

@batpad
Copy link
Collaborator Author

batpad commented May 30, 2024

The OpenScapes folks have some really good documentation around this and also some scripts to perform archiving / resource cleanups: https://github.com/NASA-Openscapes/2i2cAccessPolicies?tab=readme-ov-file#data-storage-in-the-nasa-openscapes-hub - Thanks to @ateucher !

The overall policies and retention periods seem sensible to me and the scripts to automate cleanups look really good to me.

@sunu - it would be great to take a look at this with you next week and see what we need to get setup to start performing these tasks on the hubs we manage.

For a medium to long-term solution, it would be great to have a common solution that can be easily applied to "all" hubs. I'm having a hard time thinking of a "good" solution that does not involve building some UI to let hub admins manage these tasks, as @jbusecke describes above.

So far, I see a few work-flows around the home directory cleanups:

  • Time-based: eg. Automatically archive home directories after 6 months of non-usage, etc. This is very well described in the OpenScapes document.
  • Event-based: as described by @jbusecke above - some users were added for a workshop / event, and as part of offboarding the users, their HOME directories should be automatically archived or deleted.
  • Alert-based: We may want to set some thresholds and inform an admin user for cases of unusually high HOME directory usage based on policies set by the hub admins

Trying to think of what an "MVP" here could look like, that would already be broadly useful:

  • As part of hub configuration, we can set time-based rules for automated home directory clean-ups. The archive-home-dirs.py script developed by @ateucher and @yuvipanda looks really great. I think the idea to archive to s3 as a first step is great, so the action is reversible. Let's do this manually for now to get a feel of things, but would be great to think about eventually automating via hub configuration.
  • It does seem like there's a pretty strong use-case for "make it easy for a Hub admin to pass in a username / list of usernames and perform clean-ups on those user home directories". Admins can probably choose whether to archive and delete, or just delete, etc. @yuvipanda does this seem feasible as something to think about to build into the server admin UI somewhere?
  • "Alerting rules and notification": It seems like it might be nice to be able to configure some thresholds of usage to get alerted on, but maybe that's a pipe dream? Perhaps we can start with figuring out and better document process for looking at Grafana dashboards and identifying things that may need manual intervention to clean-up?

The archive-home-dirs script seems like a great start - we'll work on this process a bit manually to get a better feel for things to do cleanups initially, but I do think it would be great to work on some of the automation and Admin UI bits soon.

I think it's likely good to let this marinate a bit and see if there's other things being worked on or other thoughts before diving into things here.

@ateucher
Copy link

ateucher commented May 30, 2024

Thanks for this @batpad! I can take very little credit for the archive-home-dirs.py script, that was 99% @yuvipanda :)

As for the process, a few things jump out to me (mostly in agreement with you):

  • The allusers directory always present in my home directory does make me a bit nervous, it would be nice to be able to log in as an admin or regular user based on what I want to do in there.
  • It's nice that the 6 month auto-removal of archives in S3 is handled by a delete-after-expiry lifecycle rule. The S3 bucket we archive to has another lifecycle rule that transitions to Glacier Instant Retrieval after 3 days, so that storage is really cheap. Also thanks to @yuvipanda I believe!
  • As mentioned we don't currently have a way to alert users (other than manual email) about the archival of their dir, nor how to request restoration. It could be done in a GitHub issue, but that seems maybe a bit too public.
  • The manual process of identifying home dirs to archive, and then doing it with the script is still a bit undefined. We decided on the 6 month policy but haven't laid out exactly what the process is.
  • We also need to align the home dir cleanup policies with policies for removing hub access (by removing from appropriate GitHub teams) - this will look different for different categories of users (single workshop, learning cohorts, long-term access).
  • I am working on a monthly usage report that pulls together usage data from Grafana (prometheus) and cost data from AWS so we can monitor that. I'll update here when it's futher along.
  • One quirky thing is that home directory names go through some character escaping, which appears to replace any - with -2d so linking home dir names to GitHub usernames takes some machinations.

Also FYI, the access policy doc will be moved to the NASA-Openscapes cookbook soon. I will be sure to leave a link in the README to the new home.

@ateucher
Copy link

This seems relevant here too: 2i2c-org/infrastructure#4159

@ateucher
Copy link

One other thought is that at the same time we start to enforce home directory usage, we need to give users the tools and skills to use alternatives - i.e., get comfortable using S3 for storage.

@batpad
Copy link
Collaborator Author

batpad commented May 31, 2024

One other thought is that at the same time we start to enforce home directory usage, we need to give users the tools and skills to use alternatives - i.e., get comfortable using S3 for storage.

I do think the biggest benefit we can get is from trainings and documentation. I imagine there are a few actual use cases for using the persistent home directory storage, but for a LOT of use-cases, either:

  • The user was doing some temporary file processing and just did not clean up. Temp file processing should happen using /tmp/ or so - this will also be faster for the user, as well as be automatically cleaned up when the container is shut down.
  • The user wants to archive some outputs, etc. - for which, as you say, s3 should be used.

Would be great to have these well written out with practical examples as something that can be shared with users as part of trainings / hub onboarding.

This seems relevant here too: 2i2c-org/infrastructure#4159

This ticket is great! Thank you @ateucher @yuvipanda :-) - that perfectly covers the "Alert-based" use-case I mention above. Happy to discuss and see if there's something we can do to help moving that forward.

@ateucher
Copy link

💯 @batpad.

We have a basic tutorial on storing data in the $SCRATCH_BUCKET and $PERSISTENT_BUCKET here: https://nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/using-s3-storage.html. We presented this in one of our cohort calls but haven't seen much evidence of their use yet. It's still missing guidance on using /tmp/ though.

@wildintellect
Copy link

Indirectly some use cases are outlined in the MAAP docs https://docs.maap-project.org/en/latest/getting_started/getting_started.html#MAAP-Storage-Options
Specifically

  • Code should be in version control,
  • large data you need to keep in buckets (whose bucket is another issue), MAAP provides buckets, I think the only reason people use them is they are auto-mounted to appear as folders.
  • Anything else left in home should be expected to vanish at any time.
    I've never seen /tmp described in any of the cloud based notebooks (/tmp and /scratch are pretty traditional on HPC) but they really should be. scratch aligning much most closely with a SCRATCH bucket, but I wouldn't use buckets for that, performance would be terrible.

@yuvipanda
Copy link
Collaborator

Just a note that we're going to be upstreaming the homedirectories report grafana dashboard. This works fine, and I'll finish up a bit of docs and get this up later this week: https://github.com/jupyterhub/grafana-dashboards/compare/main...yuvipanda:grafana-dashboards:homedirs?expand=1

@batpad batpad closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PI 24.3 Q2, 2024
Projects
None yet
Development

No branches or pull requests

6 participants