-
Notifications
You must be signed in to change notification settings - Fork 142
Investigation: Add support for instance profiles #322
Investigation: Add support for instance profiles #322
Conversation
I've started reviewing this, but it'll take more time than I have today. Will hopefully be able to get through it this week |
Hey @stevearc I appreciate you taking the time. Digging deep into how botocore handles rotating credentials for the instance profile credentials provider and how boto3 creates and uses sessions, everything should work? My only remaining dark corner is specifically around the collections for the service resource, e.g. Bucket.all().filter, etc. When I get some time I plan to create a couple of minimal test to try to isolate the specific failure. |
Currently failing test is related to changes for the health, and are in line with the change made. This was kind of something I was thinking through and proposing, but happy to revert that piece if you disagree about the points made. Failing test:
|
Hm, actually after doing the math just going to revert that part... My thought was why pay for something if we'd never expect it to change, but there are |
ok, so latest test failures are due to how the test mocks out the bucket instance. Unsure of immediate solution, but will think on it a bit |
Please hold on this PR; I am doing some additional investigation to identify root cause of issue. I've been investigating more by looking through botocore and boto3 source code and want to make sure that the problem is resolved and that the fix specifically addresses it. TY for your time so far |
Trying to not write a miniature novel of findings and progress on the investigation because honestly there is just a ton there. However, key takeaways:
Basically, all signs are pointing to "this should not be possible", so need to dig in a bit more. As an aside, since upgrading from 1.3.7 to 1.3.11 we're seeing much lower memory usage, which is awesome :) |
Main threads of investigation (just sharing in case someone is interested):
|
When configured to run with multiple processes, Pyramid will fork after setting up some resources. For most python objects that will transparently work, but it has caused some problems in the past with SQL session handles. Not sure if anything like that is happening here, but worth noting if you're going to investigate threading.
This to me is the most obvious place to start. If it doesn't reproduce in a simpler environment when you would expect it to, then there's something about the problem that we don't understand. |
Wanted to just drop an update: I got lucky this time and found: boto/botocore#1617 which seems very relevant and related. I'm going to leave open for now while I continue to investigate |
Hey @stevearc! So after talking more with our infrastructure team, we're starting to really suspect the issue is actually with kiam. I'm going to close this PR for now, but will keep my fork up with the changes in case we end up finding that there's more than one thing happening. |
dependent on / includes changes from #321
Related to issue #319
This is an initial pass at the implementation, and is functionally tested (by me) and is running in a deployed environment.
I've been trying to track down exactly how/where/when/why but here's what I believe I've observed:
Caveat: while I have a number of years of experience as a developer/programmer/engineer, i've traditionally stayed away from specifically web programming, so might be missing some of the web server specific details.
The current implementation of pypicloud (the one on master) appears to creates the storage backend once, when the cache is created ref
In the case of the S3 backend this in turn creates a reference to the bucket resource (boto3.resource("s3").Bucket).
It appears that this bucket is kept alive for an indeterminate amount of time. To be honest, it seems like it's kept alive for the length of the server from what i'm observing, but the "request" argument to the constructor is really making me doubt myself here.
I do know that without these patches (or equivalent) we reproducibly observe 500 - internal server error messages (which when you look at the logs on the pod are specifically expired credentials) about 15 minutes after server startup, which is the same length of time that the instance profile credentials are valid for.
Generally, Boto3 / botocore will handle rotating the instance profile credentials automatically in almost all cases, transparently to the user. It appears however that the resource objects do not automatically refresh their credentials when they are long lived. This observation is what led me to suspect that the resource is kept alive.
The implementation here makes the
self.bucket
member a property which will create the bucket if it doesn't exist the first time it's retrieved and will return a new boto3 Bucket resource to that same bucket on subsequent calls.This should hopefully minimize any disruption to anyone else.
I ended up removing the
get_bucket
method since it was only used in one location, but am happy to add it back; it ended up being only the two lines that are now the in bucket getter, so it seemed redundant to keep it as a separate function.I also added a few (What I believed to be minor) updates, and left comments explaining my thinking where I thought it'd be useful.
Welcome all feedback and discussion here; this is now the last remaining patch I manually maintain for our deployments so i'd be very happy to see it merged upstream.
TYVM!
-Tyler