Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPP-3505: Rewrite Account Bearer Token Authentication #5272

Open
wants to merge 68 commits into
base: main
Choose a base branch
from

Conversation

jwhitlock
Copy link
Member

@jwhitlock jwhitlock commented Dec 20, 2024

This PR adds a new implementation of Bearer Token Authentication, for creating a new user or authorizing with a Mozilla account user logged into Firefox. By default, the existing authentication implementation is used. Setting the environment variable FXA_TOKEN_AUTH_VERSION=2025 picks the new implementation.

While working on MPP-3505 (An IntegrityError while setting up a Relay account for a Mozilla account user through Firefox), I found one important bug in the existing implementation. The cache for the Accounts introspection response uses hash(token) as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.

I'm reluctant to fix just this issue, because we don't know what will happen if we start using the cache reliably.

The new implementation uses a cache key that will be consistent across pods. It also has some potential improvements.:

  • Accounts introspection API responses or error are cached. In the existing implementation, 'write' requests like a POST call the API and do not cache the results.
  • /api/v1/terms-accepted-user more consistently returns a 401 or 403 response. The Firefox integration is looking for a 401 response or a 403 response. The existing implementation returns 404 in some instances.
  • The new code returns 502 Service Unavailable when the upstream Accounts API is unavailable. This could be tuned if the Firefox integration handles this poorly.
  • The new code tracks the time it takes to call the Account introspection API and profile API. Tracking these timers will allow us to determine when Accounts is having an issue that affects Relay.
  • The new FxaTokenAuthentication is based on Django REST Framework's TokenAuthentication, uses permissions classes to add permission checks beyond the token check, and returns token details in request.auth. This makes the authentication look more like other DRF authentication, and allows better code sharing between /api/v1/terms-accepted-user and other endpoints like /api/v1/relayaddresses used by the Firefox integration.

@jwhitlock jwhitlock marked this pull request as draft December 20, 2024 18:17
@groovecoder
Copy link
Member

/api/v1/terms-accepted-user returns other error codes for some situations, like a 404 if the introspection result does not have a sub field. However, the Firefox integration is looking for a 401 response or a 403 response. We should use a 5xx response, like 502 Service Unavailable, for these unexpected responses, to signal to the client to retry later.

Do we know how many 5xx responses we would return with this change? Those turn into ugly error messages to Firefox clients and end-users. I wonder if we should spend the time to prevent the errors completely rather than surface more up to end-users?

Copy link
Member

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look thru the tests yet, but I left some first comments on the first draft of the functional code.

def get_cache_key(token):
return hash(token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: "Python's built-in hash() function can produce different results between different Python processes"

and

The cache for the Accounts introspection response uses hash(token) as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.

api/authentication.py Outdated Show resolved Hide resolved
api/authentication.py Outdated Show resolved Hide resolved
# If the response is an error, raise an exception
if isinstance(fxa_resp, IntrospectionError):
if not fxa_resp.from_cache:
fxa_resp.save_to_cache(cache, token, default_cache_timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): If I understand correctly, this means we will cache introspection errors for just 60s? I like that. When I first saw the cache + error code, I thought it might cache something like a timeout error for a long time. This looks like it will prevent a thundering herd AND recover from temporary errors in a reasonable amount of time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe the previous code also cached errors for 60 seconds. Except, because of hash(), it is possible that cached errors were never loaded again.

api/authentication.py Outdated Show resolved Hide resolved

fxa_resp: IntrospectionResponse | IntrospectionError | None = None
if use_cache:
fxa_resp = load_introspection_result_from_cache(cache, token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): What happens if the cached response is an expired token? It looks like we save the response with a cache expiration time to match the token expiration time, but does the cached = cache.get(cache_key) call return None in this scenario? I didn't see a test for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few systems involved, and different answers for each: the cache, the FxA endpoint, and Relay.

The Redis backend checks the expire time on a GET, and if the key is expired, it deletes it and returns None. The tests exercise cache misses and hits, trying both. I'm not sure what the Django in-memory cache does, which is used in tests, but it probably doesn't matter.

If you attempt to introspect an expired token, I expect FxA to return an error. I'd need to research what that error is.

If a user presents an expired token that we've cached, we treat it as authorized. I don't think we check the expiration of cached tokens. There's a chance this has never happened in production, since using hash() meant every call is a cache miss. But this new code would be a cache hit, so we may want to check for expiration or near-future expiration.

Comment on lines 374 to 376
self.use_cache = (method == "POST" and path == "/api/v1/relayaddresses/") or (
method not in ("POST", "DELETE", "PUT")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): This logic seems harder to follow than nested if statements. Can we revert that? (I also don't remember why we force use_cache=True when POSTing to /api/v1/relayaddresses but I'm sure there's a reason.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can revert to the nested if statements. I'll see if I can determine from the development PR why we use the cache for creating a new mask. Maybe the Firefox code assumes if the token was good to get the list of masks, it will still be good when creating the mask, and has less error handling.

if response:
return response

# Since this takes time, see if another request created the SocialAccount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): do you suspect this might be the cause of the SocialAccount.DoesNotExist errors? That separate requests are creating/retrieving the SocialAccount object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my best guess. But, it could be something weirder!

Copy link
Member Author

@jwhitlock jwhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @groovecoder! This one is still a work in progress. My notes from last year:

  • Add the token to IntrospectionResult / IntrospectionError. Either omit from the cache or require it to be the same.
  • Remove token processing from terms_accepted_user
  • Try permissions classes instead of permission checks in app
  • Try always returning anon user, use permission check to reject
  • Add timing for introspection API, profile API, and log it and stat it
  • Are we done?
  • Go back to main and implement as a parallel auth system, throw away this branch

So I do not think this will be the final PR. I'll answer some questions if I can before that final PR.

Additional research and tasks from this review:

  • How does the existing code handle expired tokens (if it does)?
  • What does FxA's introspect do for an expired token?
  • Test and handle cached but expired tokens
  • Why does POST /api/v1/relayaddresses use the cache, while POST to other endpoints skips it?


fxa_resp: IntrospectionResponse | IntrospectionError | None = None
if use_cache:
fxa_resp = load_introspection_result_from_cache(cache, token)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few systems involved, and different answers for each: the cache, the FxA endpoint, and Relay.

The Redis backend checks the expire time on a GET, and if the key is expired, it deletes it and returns None. The tests exercise cache misses and hits, trying both. I'm not sure what the Django in-memory cache does, which is used in tests, but it probably doesn't matter.

If you attempt to introspect an expired token, I expect FxA to return an error. I'd need to research what that error is.

If a user presents an expired token that we've cached, we treat it as authorized. I don't think we check the expiration of cached tokens. There's a chance this has never happened in production, since using hash() meant every call is a cache miss. But this new code would be a cache hit, so we may want to check for expiration or near-future expiration.

# If the response is an error, raise an exception
if isinstance(fxa_resp, IntrospectionError):
if not fxa_resp.from_cache:
fxa_resp.save_to_cache(cache, token, default_cache_timeout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe the previous code also cached errors for 60 seconds. Except, because of hash(), it is possible that cached errors were never loaded again.

Comment on lines 374 to 376
self.use_cache = (method == "POST" and path == "/api/v1/relayaddresses/") or (
method not in ("POST", "DELETE", "PUT")
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can revert to the nested if statements. I'll see if I can determine from the development PR why we use the cache for creating a new mask. Maybe the Firefox code assumes if the token was good to get the list of masks, it will still be good when creating the mask, and has less error handling.

@jwhitlock
Copy link
Member Author

The decision for skipping a cache read on POST /api/v1/relayaddresses is on MPP-3156. There's not a justification, but I'm guessing that the POST always comes after a GET? I'd be OK to remove this exception.

Remaining items:

  • Add timing stat for profile fetch
  • (Maybe) add tests for logs and stats
  • Add checks and tests for expired tokens
  • Try sending an expired token to FxA introspect
  • Throw away and restart on main

Reproduce the IntegrityError by creating a matching user and
SocialAccount after checking for a matching user by email. This may not
be the exact mechanism in production, but it does produce the same
traceback.
Lots of changes that could have been in multiple commits.

In authentication tests:

* Split AuthenticationMiscellaneous TestCase into IntrospectTokenTests
  and GetFxaUidFromOauthTokenTests, remove name prefixes, simplify
  setup.
* Convert _setup_fxa_response to setup_fxa_introspect. It now constructs
  the payload as well as mocking the response, and returns the mocked
  response and expected cached data.
* Use self.assertRaisesMessage for consistent exception checking.
* Assert on the mocked response call_count, not the URL-matched count.

In terms_accepted_user tests:

* Add _mock_fxa_profile_response
* Use new setup_fxa_introspect
* Use _setup_client everywhere
* Add mocked response checks
* Add cache value checks, rename incorrect test titles
Create the SocialAccount in a new function outside of a try block, to
avoid nested exceptions.
Because the Mozilla Accounts profile fetch takes a while, this is the
likely time for a parallel SocialAccount to be created. Check again
before proceeding to create a new one.
The Django logout() command, since at least 1.10, also checks if the
user was logged in or not, so our check is redundant (and has missing
branch coverage)
If a colliding SocialAccount is created by a separate request, catch the
IntegrityError and return 500.
Reimplement FxaTokenAuthentication on TokenAuthentication, to get the
DRF-provided parsing of token authentication headers. This changes the
status code for a header of 'Authorization: Bearer ' (token value
ommited) from a 400 (Bad Request) to a 401 (Unauthorized).
The results of hash(str) changes between Python instances, so the
previous version would lead to many cache misses.
Return the introspection results instead of the expected cache contents.
This may make the cache value change more obvious.
Instead of caching once with 60 seconds TTL and again with expiration
TTL, cache once with the proper TTL.
Move the token validation logic to `introspect_token`. This now either
returns an IntrospectionResponse, for a valid token for an active user,
or an IntrospectionError, if something went wrong. These both have a
`save_to_cache` method to save them to a cache, and a new function
`load_introspection_result_from_cache` can load it.

`get_fxa_uid_from_oauth_token` changes to `introspect_token_or_raise`.
It loads results from the cache, caches results, and raises exceptions.

A large change is it caches results when `use_cache` is False. This
means more introspect results are cached.

FxaTokenAuthentication.authenticate now returns an IntrospectionResponse
as the second parameter.

The terms_accepted_user endpoint returns have changed for errors. The
new status codes:

* 503 - The introspection APO is (temporarily) unavailable
* 401 - The Bearer token is invalid, or the user is inactive
@jwhitlock jwhitlock marked this pull request as ready for review January 17, 2025 23:51
@jwhitlock jwhitlock force-pushed the wip-terms-accepted-user-mpp-3505 branch from a86f701 to e17060f Compare January 18, 2025 00:12
@jwhitlock jwhitlock changed the title WIP MPP-3505: RewriteAccount Bearer Token Authentication MPP-3505: Rewrite Account Bearer Token Authentication Jan 21, 2025
@jwhitlock
Copy link
Member Author

@groovecoder this is ready for review. I was able to implement the flag without starting over from main.

This is the "big switch" version of this change. There are other ways to make these changes. For example, I could extract smaller bits, like checking for a social account again after fetching a profile. Let me know if you'd like a different strategy.

@groovecoder
Copy link
Member

This is the "big switch" version of this change. There are other ways to make these changes. For example, I could extract smaller bits, like checking for a social account again after fetching a profile. Let me know if you'd like a different strategy.

We can keep it like this. Though, I'm in no hurry to throw any big Fx integration switches while we're working to expand the Fx integration 10-100x, unless we think we need something like the caching to handle the 10-100x load? But then, it seems like if we fix the caching, that's the change that could have the most unknown effect on the Fx integration flow?

@jwhitlock
Copy link
Member Author

We can keep it like this. Though, I'm in no hurry to throw any big Fx integration switches while we're working to expand the Fx integration 10-100x, unless we think we need something like the caching to handle the 10-100x load? But then, it seems like if we fix the caching, that's the change that could have the most unknown effect on the Fx integration flow?

I agree, I'd want to keep the proven code running during the launch of the integration. The proven code is responsible for 50% of errors, but it is likely these do not impact signup flow, since they haven't shown up in QA testing or customer service reports.

Ideally, I'd be able to run in prod before the ship date, to collect some data in that configuration. The next best is to run in stage, assuming there is a way to test the Firefox integration in stage. That should allow collecting data without affecting Relay users.

Do you know if there is a way to test the Firefox integration in stage?

@groovecoder
Copy link
Member

Do you know if there is a way to test the Firefox integration in stage?

Yes there are two about:config prefs that determine which Relay url Firefox uses: signon.firefoxRelay.base_url and signon.firefoxRelay.manage_url.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants