-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPP-3505: Rewrite Account Bearer Token Authentication #5272
base: main
Are you sure you want to change the base?
Conversation
Do we know how many |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look thru the tests yet, but I left some first comments on the first draft of the functional code.
def get_cache_key(token): | ||
return hash(token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL: "Python's built-in hash() function can produce different results between different Python processes"
and
The cache for the Accounts introspection response uses hash(token) as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.
# If the response is an error, raise an exception | ||
if isinstance(fxa_resp, IntrospectionError): | ||
if not fxa_resp.from_cache: | ||
fxa_resp.save_to_cache(cache, token, default_cache_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (non-blocking): If I understand correctly, this means we will cache introspection errors for just 60s? I like that. When I first saw the cache + error code, I thought it might cache something like a timeout error for a long time. This looks like it will prevent a thundering herd AND recover from temporary errors in a reasonable amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe the previous code also cached errors for 60 seconds. Except, because of hash()
, it is possible that cached errors were never loaded again.
|
||
fxa_resp: IntrospectionResponse | IntrospectionError | None = None | ||
if use_cache: | ||
fxa_resp = load_introspection_result_from_cache(cache, token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (non-blocking): What happens if the cached response is an expired token? It looks like we save the response with a cache expiration time to match the token expiration time, but does the cached = cache.get(cache_key)
call return None
in this scenario? I didn't see a test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few systems involved, and different answers for each: the cache, the FxA endpoint, and Relay.
The Redis backend checks the expire time on a GET, and if the key is expired, it deletes it and returns None. The tests exercise cache misses and hits, trying both. I'm not sure what the Django in-memory cache does, which is used in tests, but it probably doesn't matter.
If you attempt to introspect an expired token, I expect FxA to return an error. I'd need to research what that error is.
If a user presents an expired token that we've cached, we treat it as authorized. I don't think we check the expiration of cached tokens. There's a chance this has never happened in production, since using hash()
meant every call is a cache miss. But this new code would be a cache hit, so we may want to check for expiration or near-future expiration.
api/authentication.py
Outdated
self.use_cache = (method == "POST" and path == "/api/v1/relayaddresses/") or ( | ||
method not in ("POST", "DELETE", "PUT") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): This logic seems harder to follow than nested if
statements. Can we revert that? (I also don't remember why we force use_cache=True
when POST
ing to /api/v1/relayaddresses
but I'm sure there's a reason.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can revert to the nested if
statements. I'll see if I can determine from the development PR why we use the cache for creating a new mask. Maybe the Firefox code assumes if the token was good to get the list of masks, it will still be good when creating the mask, and has less error handling.
if response: | ||
return response | ||
|
||
# Since this takes time, see if another request created the SocialAccount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (non-blocking): do you suspect this might be the cause of the SocialAccount.DoesNotExist
errors? That separate requests are creating/retrieving the SocialAccount
object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my best guess. But, it could be something weirder!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @groovecoder! This one is still a work in progress. My notes from last year:
- Add the token to
IntrospectionResult
/IntrospectionError
. Either omit from the cache or require it to be the same. - Remove token processing from
terms_accepted_user
- Try permissions classes instead of permission checks in app
- Try always returning anon user, use permission check to reject
- Add timing for introspection API, profile API, and log it and stat it
- Are we done?
- Go back to main and implement as a parallel auth system, throw away this branch
So I do not think this will be the final PR. I'll answer some questions if I can before that final PR.
Additional research and tasks from this review:
- How does the existing code handle expired tokens (if it does)?
- What does FxA's introspect do for an expired token?
- Test and handle cached but expired tokens
- Why does
POST /api/v1/relayaddresses
use the cache, whilePOST
to other endpoints skips it?
|
||
fxa_resp: IntrospectionResponse | IntrospectionError | None = None | ||
if use_cache: | ||
fxa_resp = load_introspection_result_from_cache(cache, token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few systems involved, and different answers for each: the cache, the FxA endpoint, and Relay.
The Redis backend checks the expire time on a GET, and if the key is expired, it deletes it and returns None. The tests exercise cache misses and hits, trying both. I'm not sure what the Django in-memory cache does, which is used in tests, but it probably doesn't matter.
If you attempt to introspect an expired token, I expect FxA to return an error. I'd need to research what that error is.
If a user presents an expired token that we've cached, we treat it as authorized. I don't think we check the expiration of cached tokens. There's a chance this has never happened in production, since using hash()
meant every call is a cache miss. But this new code would be a cache hit, so we may want to check for expiration or near-future expiration.
# If the response is an error, raise an exception | ||
if isinstance(fxa_resp, IntrospectionError): | ||
if not fxa_resp.from_cache: | ||
fxa_resp.save_to_cache(cache, token, default_cache_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe the previous code also cached errors for 60 seconds. Except, because of hash()
, it is possible that cached errors were never loaded again.
api/authentication.py
Outdated
self.use_cache = (method == "POST" and path == "/api/v1/relayaddresses/") or ( | ||
method not in ("POST", "DELETE", "PUT") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can revert to the nested if
statements. I'll see if I can determine from the development PR why we use the cache for creating a new mask. Maybe the Firefox code assumes if the token was good to get the list of masks, it will still be good when creating the mask, and has less error handling.
The decision for skipping a cache read on Remaining items:
|
Reproduce the IntegrityError by creating a matching user and SocialAccount after checking for a matching user by email. This may not be the exact mechanism in production, but it does produce the same traceback.
Lots of changes that could have been in multiple commits. In authentication tests: * Split AuthenticationMiscellaneous TestCase into IntrospectTokenTests and GetFxaUidFromOauthTokenTests, remove name prefixes, simplify setup. * Convert _setup_fxa_response to setup_fxa_introspect. It now constructs the payload as well as mocking the response, and returns the mocked response and expected cached data. * Use self.assertRaisesMessage for consistent exception checking. * Assert on the mocked response call_count, not the URL-matched count. In terms_accepted_user tests: * Add _mock_fxa_profile_response * Use new setup_fxa_introspect * Use _setup_client everywhere * Add mocked response checks * Add cache value checks, rename incorrect test titles
Create the SocialAccount in a new function outside of a try block, to avoid nested exceptions.
Because the Mozilla Accounts profile fetch takes a while, this is the likely time for a parallel SocialAccount to be created. Check again before proceeding to create a new one.
The Django logout() command, since at least 1.10, also checks if the user was logged in or not, so our check is redundant (and has missing branch coverage)
If a colliding SocialAccount is created by a separate request, catch the IntegrityError and return 500.
Reimplement FxaTokenAuthentication on TokenAuthentication, to get the DRF-provided parsing of token authentication headers. This changes the status code for a header of 'Authorization: Bearer ' (token value ommited) from a 400 (Bad Request) to a 401 (Unauthorized).
The results of hash(str) changes between Python instances, so the previous version would lead to many cache misses.
Return the introspection results instead of the expected cache contents. This may make the cache value change more obvious.
Instead of caching once with 60 seconds TTL and again with expiration TTL, cache once with the proper TTL.
Move the token validation logic to `introspect_token`. This now either returns an IntrospectionResponse, for a valid token for an active user, or an IntrospectionError, if something went wrong. These both have a `save_to_cache` method to save them to a cache, and a new function `load_introspection_result_from_cache` can load it. `get_fxa_uid_from_oauth_token` changes to `introspect_token_or_raise`. It loads results from the cache, caches results, and raises exceptions. A large change is it caches results when `use_cache` is False. This means more introspect results are cached. FxaTokenAuthentication.authenticate now returns an IntrospectionResponse as the second parameter. The terms_accepted_user endpoint returns have changed for errors. The new status codes: * 503 - The introspection APO is (temporarily) unavailable * 401 - The Bearer token is invalid, or the user is inactive
Views requiring a user can use IsAuthenticated
Setting FXA_TOKEN_AUTH_VERSION to 2024 (the default) will use the existing authentication for FxA bearer tokens. Setting it to 2025 will use the new authentication method. When the new authentication is proven, the 2024 version can be deleted.
a86f701
to
e17060f
Compare
@groovecoder this is ready for review. I was able to implement the flag without starting over from main. This is the "big switch" version of this change. There are other ways to make these changes. For example, I could extract smaller bits, like checking for a social account again after fetching a profile. Let me know if you'd like a different strategy. |
We can keep it like this. Though, I'm in no hurry to throw any big Fx integration switches while we're working to expand the Fx integration 10-100x, unless we think we need something like the caching to handle the 10-100x load? But then, it seems like if we fix the caching, that's the change that could have the most unknown effect on the Fx integration flow? |
I agree, I'd want to keep the proven code running during the launch of the integration. The proven code is responsible for 50% of errors, but it is likely these do not impact signup flow, since they haven't shown up in QA testing or customer service reports. Ideally, I'd be able to run in prod before the ship date, to collect some data in that configuration. The next best is to run in stage, assuming there is a way to test the Firefox integration in stage. That should allow collecting data without affecting Relay users. Do you know if there is a way to test the Firefox integration in stage? |
Yes there are two |
This PR adds a new implementation of Bearer Token Authentication, for creating a new user or authorizing with a Mozilla account user logged into Firefox. By default, the existing authentication implementation is used. Setting the environment variable
FXA_TOKEN_AUTH_VERSION=2025
picks the new implementation.While working on MPP-3505 (An
IntegrityError
while setting up a Relay account for a Mozilla account user through Firefox), I found one important bug in the existing implementation. The cache for the Accounts introspection response useshash(token)
as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.I'm reluctant to fix just this issue, because we don't know what will happen if we start using the cache reliably.
The new implementation uses a cache key that will be consistent across pods. It also has some potential improvements.:
POST
call the API and do not cache the results./api/v1/terms-accepted-user
more consistently returns a401
or403
response. The Firefox integration is looking for a 401 response or a 403 response. The existing implementation returns404
in some instances.502 Service Unavailable
when the upstream Accounts API is unavailable. This could be tuned if the Firefox integration handles this poorly.FxaTokenAuthentication
is based on Django REST Framework's TokenAuthentication, uses permissions classes to add permission checks beyond the token check, and returns token details inrequest.auth
. This makes the authentication look more like other DRF authentication, and allows better code sharing between/api/v1/terms-accepted-user
and other endpoints like/api/v1/relayaddresses
used by the Firefox integration.