MPP-3505: Rewrite Account Bearer Token Authentication #5272

jwhitlock · 2024-12-20T18:14:51Z

This PR adds a new implementation of Bearer Token Authentication, for creating a new user or authorizing with a Mozilla account user logged into Firefox. By default, the existing authentication implementation is used. Setting the environment variable FXA_TOKEN_AUTH_VERSION=2025 picks the new implementation.

While working on MPP-3505 (An IntegrityError while setting up a Relay account for a Mozilla account user through Firefox), I found one important bug in the existing implementation. The cache for the Accounts introspection response uses hash(token) as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.

I'm reluctant to fix just this issue, because we don't know what will happen if we start using the cache reliably.

The new implementation uses a cache key that will be consistent across pods. It also has some potential improvements.:

Accounts introspection API responses or error are cached. In the existing implementation, 'write' requests like a POST call the API and do not cache the results.
/api/v1/terms-accepted-user more consistently returns a 401 or 403 response. The Firefox integration is looking for a 401 response or a 403 response. The existing implementation returns 404 in some instances.
The new code returns 502 Service Unavailable when the upstream Accounts API is unavailable. This could be tuned if the Firefox integration handles this poorly.
The new code tracks the time it takes to call the Account introspection API and profile API. Tracking these timers will allow us to determine when Accounts is having an issue that affects Relay.
The new FxaTokenAuthentication is based on Django REST Framework's TokenAuthentication, uses permissions classes to add permission checks beyond the token check, and returns token details in request.auth. This makes the authentication look more like other DRF authentication, and allows better code sharing between /api/v1/terms-accepted-user and other endpoints like /api/v1/relayaddresses used by the Firefox integration.

groovecoder · 2024-12-31T00:10:18Z

/api/v1/terms-accepted-user returns other error codes for some situations, like a 404 if the introspection result does not have a sub field. However, the Firefox integration is looking for a 401 response or a 403 response. We should use a 5xx response, like 502 Service Unavailable, for these unexpected responses, to signal to the client to retry later.

Do we know how many 5xx responses we would return with this change? Those turn into ugly error messages to Firefox clients and end-users. I wonder if we should spend the time to prevent the errors completely rather than surface more up to end-users?

groovecoder

I didn't look thru the tests yet, but I left some first comments on the first draft of the functional code.

groovecoder · 2024-12-30T23:54:21Z

api/authentication.py

 def get_cache_key(token):
-    return hash(token)


TIL: "Python's built-in hash() function can produce different results between different Python processes"

and

The cache for the Accounts introspection response uses hash(token) as a cache key. This give a consistent number for a given Python instance, but a different number on a different Python instance, like a different pod or possibly a different gunicorn thread on the same pod. This means these responses were almost never cached.

api/authentication.py

groovecoder · 2024-12-31T16:18:07Z

api/authentication.py

+    # If the response is an error, raise an exception
+    if isinstance(fxa_resp, IntrospectionError):
+        if not fxa_resp.from_cache:
+            fxa_resp.save_to_cache(cache, token, default_cache_timeout)


question (non-blocking): If I understand correctly, this means we will cache introspection errors for just 60s? I like that. When I first saw the cache + error code, I thought it might cache something like a timeout error for a long time. This looks like it will prevent a thundering herd AND recover from temporary errors in a reasonable amount of time.

Yes, I believe the previous code also cached errors for 60 seconds. Except, because of hash(), it is possible that cached errors were never loaded again.

api/authentication.py

groovecoder · 2024-12-31T16:27:30Z

api/authentication.py

+
+    fxa_resp: IntrospectionResponse | IntrospectionError | None = None
+    if use_cache:
+        fxa_resp = load_introspection_result_from_cache(cache, token)


question (non-blocking): What happens if the cached response is an expired token? It looks like we save the response with a cache expiration time to match the token expiration time, but does the cached = cache.get(cache_key) call return None in this scenario? I didn't see a test for that.

There's a few systems involved, and different answers for each: the cache, the FxA endpoint, and Relay.

The Redis backend checks the expire time on a GET, and if the key is expired, it deletes it and returns None. The tests exercise cache misses and hits, trying both. I'm not sure what the Django in-memory cache does, which is used in tests, but it probably doesn't matter.

If you attempt to introspect an expired token, I expect FxA to return an error. I'd need to research what that error is.

If a user presents an expired token that we've cached, we treat it as authorized. I don't think we check the expiration of cached tokens. There's a chance this has never happened in production, since using hash() meant every call is a cache miss. But this new code would be a cache hit, so we may want to check for expiration or near-future expiration.

groovecoder · 2024-12-31T16:34:21Z

api/authentication.py

+        self.use_cache = (method == "POST" and path == "/api/v1/relayaddresses/") or (
+            method not in ("POST", "DELETE", "PUT")
+        )


suggestion (non-blocking): This logic seems harder to follow than nested if statements. Can we revert that? (I also don't remember why we force use_cache=True when POSTing to /api/v1/relayaddresses but I'm sure there's a reason.)

Sure, I can revert to the nested if statements. I'll see if I can determine from the development PR why we use the cache for creating a new mask. Maybe the Firefox code assumes if the token was good to get the list of masks, it will still be good when creating the mask, and has less error handling.

groovecoder · 2024-12-31T16:43:18Z

api/views/privaterelay.py

+        if response:
+            return response
+
+        # Since this takes time, see if another request created the SocialAccount


question (non-blocking): do you suspect this might be the cause of the SocialAccount.DoesNotExist errors? That separate requests are creating/retrieving the SocialAccount object?

That is my best guess. But, it could be something weirder!

jwhitlock

Thanks for the review @groovecoder! This one is still a work in progress. My notes from last year:

Add the token to IntrospectionResult / IntrospectionError. Either omit from the cache or require it to be the same.
Remove token processing from terms_accepted_user
Try permissions classes instead of permission checks in app
Try always returning anon user, use permission check to reject
Add timing for introspection API, profile API, and log it and stat it
Are we done?
Go back to main and implement as a parallel auth system, throw away this branch

So I do not think this will be the final PR. I'll answer some questions if I can before that final PR.

Additional research and tasks from this review:

How does the existing code handle expired tokens (if it does)?
What does FxA's introspect do for an expired token?
Test and handle cached but expired tokens
Why does POST /api/v1/relayaddresses use the cache, while POST to other endpoints skips it?

jwhitlock · 2025-01-09T23:19:31Z

api/authentication.py

+
+    fxa_resp: IntrospectionResponse | IntrospectionError | None = None
+    if use_cache:
+        fxa_resp = load_introspection_result_from_cache(cache, token)


There's a few systems involved, and different answers for each: the cache, the FxA endpoint, and Relay.

The Redis backend checks the expire time on a GET, and if the key is expired, it deletes it and returns None. The tests exercise cache misses and hits, trying both. I'm not sure what the Django in-memory cache does, which is used in tests, but it probably doesn't matter.

If you attempt to introspect an expired token, I expect FxA to return an error. I'd need to research what that error is.

If a user presents an expired token that we've cached, we treat it as authorized. I don't think we check the expiration of cached tokens. There's a chance this has never happened in production, since using hash() meant every call is a cache miss. But this new code would be a cache hit, so we may want to check for expiration or near-future expiration.

jwhitlock · 2025-01-09T23:21:12Z

api/authentication.py

+    # If the response is an error, raise an exception
+    if isinstance(fxa_resp, IntrospectionError):
+        if not fxa_resp.from_cache:
+            fxa_resp.save_to_cache(cache, token, default_cache_timeout)


Yes, I believe the previous code also cached errors for 60 seconds. Except, because of hash(), it is possible that cached errors were never loaded again.

jwhitlock · 2025-01-09T23:26:08Z

api/authentication.py

+        self.use_cache = (method == "POST" and path == "/api/v1/relayaddresses/") or (
+            method not in ("POST", "DELETE", "PUT")
+        )


Sure, I can revert to the nested if statements. I'll see if I can determine from the development PR why we use the cache for creating a new mask. Maybe the Firefox code assumes if the token was good to get the list of masks, it will still be good when creating the mask, and has less error handling.

jwhitlock · 2025-01-14T23:10:28Z

The decision for skipping a cache read on POST /api/v1/relayaddresses is on MPP-3156. There's not a justification, but I'm guessing that the POST always comes after a GET? I'd be OK to remove this exception.

Remaining items:

Add timing stat for profile fetch
(Maybe) add tests for logs and stats
Add checks and tests for expired tokens
Try sending an expired token to FxA introspect
Throw away and restart on main

Reproduce the IntegrityError by creating a matching user and SocialAccount after checking for a matching user by email. This may not be the exact mechanism in production, but it does produce the same traceback.

Lots of changes that could have been in multiple commits. In authentication tests: * Split AuthenticationMiscellaneous TestCase into IntrospectTokenTests and GetFxaUidFromOauthTokenTests, remove name prefixes, simplify setup. * Convert _setup_fxa_response to setup_fxa_introspect. It now constructs the payload as well as mocking the response, and returns the mocked response and expected cached data. * Use self.assertRaisesMessage for consistent exception checking. * Assert on the mocked response call_count, not the URL-matched count. In terms_accepted_user tests: * Add _mock_fxa_profile_response * Use new setup_fxa_introspect * Use _setup_client everywhere * Add mocked response checks * Add cache value checks, rename incorrect test titles

Create the SocialAccount in a new function outside of a try block, to avoid nested exceptions.

Because the Mozilla Accounts profile fetch takes a while, this is the likely time for a parallel SocialAccount to be created. Check again before proceeding to create a new one.

The Django logout() command, since at least 1.10, also checks if the user was logged in or not, so our check is redundant (and has missing branch coverage)

If a colliding SocialAccount is created by a separate request, catch the IntegrityError and return 500.

Reimplement FxaTokenAuthentication on TokenAuthentication, to get the DRF-provided parsing of token authentication headers. This changes the status code for a header of 'Authorization: Bearer ' (token value ommited) from a 400 (Bad Request) to a 401 (Unauthorized).

The results of hash(str) changes between Python instances, so the previous version would lead to many cache misses.

Return the introspection results instead of the expected cache contents. This may make the cache value change more obvious.

Instead of caching once with 60 seconds TTL and again with expiration TTL, cache once with the proper TTL.

Move the token validation logic to `introspect_token`. This now either returns an IntrospectionResponse, for a valid token for an active user, or an IntrospectionError, if something went wrong. These both have a `save_to_cache` method to save them to a cache, and a new function `load_introspection_result_from_cache` can load it. `get_fxa_uid_from_oauth_token` changes to `introspect_token_or_raise`. It loads results from the cache, caches results, and raises exceptions. A large change is it caches results when `use_cache` is False. This means more introspect results are cached. FxaTokenAuthentication.authenticate now returns an IntrospectionResponse as the second parameter. The terms_accepted_user endpoint returns have changed for errors. The new status codes: * 503 - The introspection APO is (temporarily) unavailable * 401 - The Bearer token is invalid, or the user is inactive

Views requiring a user can use IsAuthenticated

Setting FXA_TOKEN_AUTH_VERSION to 2024 (the default) will use the existing authentication for FxA bearer tokens. Setting it to 2025 will use the new authentication method. When the new authentication is proven, the 2024 version can be deleted.

jwhitlock · 2025-01-21T15:45:44Z

@groovecoder this is ready for review. I was able to implement the flag without starting over from main.

This is the "big switch" version of this change. There are other ways to make these changes. For example, I could extract smaller bits, like checking for a social account again after fetching a profile. Let me know if you'd like a different strategy.

groovecoder · 2025-01-24T14:43:50Z

This is the "big switch" version of this change. There are other ways to make these changes. For example, I could extract smaller bits, like checking for a social account again after fetching a profile. Let me know if you'd like a different strategy.

We can keep it like this. Though, I'm in no hurry to throw any big Fx integration switches while we're working to expand the Fx integration 10-100x, unless we think we need something like the caching to handle the 10-100x load? But then, it seems like if we fix the caching, that's the change that could have the most unknown effect on the Fx integration flow?

jwhitlock · 2025-01-24T15:41:10Z

We can keep it like this. Though, I'm in no hurry to throw any big Fx integration switches while we're working to expand the Fx integration 10-100x, unless we think we need something like the caching to handle the 10-100x load? But then, it seems like if we fix the caching, that's the change that could have the most unknown effect on the Fx integration flow?

I agree, I'd want to keep the proven code running during the launch of the integration. The proven code is responsible for 50% of errors, but it is likely these do not impact signup flow, since they haven't shown up in QA testing or customer service reports.

Ideally, I'd be able to run in prod before the ship date, to collect some data in that configuration. The next best is to run in stage, assuming there is a way to test the Firefox integration in stage. That should allow collecting data without affecting Relay users.

Do you know if there is a way to test the Firefox integration in stage?

groovecoder · 2025-01-28T15:37:24Z

Do you know if there is a way to test the Firefox integration in stage?

Yes there are two about:config prefs that determine which Relay url Firefox uses: signon.firefoxRelay.base_url and signon.firefoxRelay.manage_url.

jwhitlock marked this pull request as draft December 20, 2024 18:17

groovecoder reviewed Dec 31, 2024

View reviewed changes

jwhitlock commented Jan 9, 2025

View reviewed changes

jwhitlock added 25 commits January 17, 2025 17:38

Fix typos

a5b7c52

Reproduce IntegrityError on terms_accepted_user

9d510ab

Reproduce the IntegrityError by creating a matching user and SocialAccount after checking for a matching user by email. This may not be the exact mechanism in production, but it does produce the same traceback.

Extract to _create_socialaccount_from_bearer_token

a74620c

Create the SocialAccount in a new function outside of a try block, to avoid nested exceptions.

Extract _get_fxa_profile_from_bearer_token

3a29d66

Look for matching SA after FxA profile fetch

8ee3bdf

Because the Mozilla Accounts profile fetch takes a while, this is the likely time for a parallel SocialAccount to be created. Check again before proceeding to create a new one.

Omit fxa_uid if the user has disabled metrics

a8101b0

Skip coverage for belt-and-suspender code

98e4ea1

Skip auth check before logging out user

2fd6775

The Django logout() command, since at least 1.10, also checks if the user was logged in or not, so our check is redundant (and has missing branch coverage)

Catch IntegrityError when creating new SA

2d00f2e

If a colliding SocialAccount is created by a separate request, catch the IntegrityError and return 500.

Add request timeout tests

92cf9f2

Handle profile timeout

ee748c9

Re-raise introspect timeout

e667006

Fix cache key function

db4a890

The results of hash(str) changes between Python instances, so the previous version would lead to many cache misses.

Change setup_fxa_introspect to return FxA data

311cb5b

Return the introspection results instead of the expected cache contents. This may make the cache value change more obvious.

Change cached key from "json" to "data"

f34f0b8

Move types, data is always a dict

97c1723

Cache introspect result once

ddef83a

Instead of caching once with 60 seconds TTL and again with expiration TTL, cache once with the proper TTL.

Cache introspect response if called

729d713

Tune name, docstrings for mocks

cb30de3

Add tests for failed IntrospectionResponse init

325eb0c

Drop v=1 in cache data, YNGNI

765a1e3

Refactor IntrospectionError.raiseException

98eb905

jwhitlock added 20 commits January 17, 2025 17:38

Add token to IntrospectionReponse/Error

7cbeb88

Add HasValidFxaToken permission

630c089

Remove FxaTokenAuthenticationRelayUserOptional

846a80c

Views requiring a user can use IsAuthenticated

Remove permission checks from auth

e396428

Add request_s timer to introspect results

7cacc87

Capture introspection request time

f69c03c

Log introspection time

6a61640

Log profile fetch time

ceb851f

Update comments

23a3c84

Revert to nested if

ae360a3

Emit timing metric for introspect errors

e1ed12e

Emit timing metric for introspect success

2364256

Update docstring

f4674cd

Update docstring

f2412d6

Rearrange _get_fxa_profile_from_bearer_token

9220c67

Change _get_fxa_profile_from_bearer_tokeni return

cdc8725

Emit timing metric for profile fetch

77304f6

Add IntrospectionResponse.is_expired

6e1a1d6

Check token expiration

8948a29

Add env FXA_TOKEN_AUTH_VERSION

cbef10f

Setting FXA_TOKEN_AUTH_VERSION to 2024 (the default) will use the existing authentication for FxA bearer tokens. Setting it to 2025 will use the new authentication method. When the new authentication is proven, the 2024 version can be deleted.

jwhitlock marked this pull request as ready for review January 17, 2025 23:51

Add return code for TokenExpired

e17060f

jwhitlock force-pushed the wip-terms-accepted-user-mpp-3505 branch from a86f701 to e17060f Compare January 18, 2025 00:12

jwhitlock assigned jwhitlock and groovecoder Jan 21, 2025

jwhitlock changed the title ~~WIP MPP-3505: RewriteAccount Bearer Token Authentication~~ MPP-3505: Rewrite Account Bearer Token Authentication Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPP-3505: Rewrite Account Bearer Token Authentication #5272

MPP-3505: Rewrite Account Bearer Token Authentication #5272

jwhitlock commented Dec 20, 2024 •

edited

Loading

groovecoder commented Dec 31, 2024

groovecoder left a comment

groovecoder Dec 30, 2024

groovecoder Dec 31, 2024

jwhitlock Jan 9, 2025

groovecoder Dec 31, 2024

jwhitlock Jan 9, 2025

groovecoder Dec 31, 2024

jwhitlock Jan 9, 2025

groovecoder Dec 31, 2024

jwhitlock Jan 18, 2025

jwhitlock left a comment

jwhitlock Jan 9, 2025

jwhitlock Jan 9, 2025

jwhitlock Jan 9, 2025

jwhitlock commented Jan 14, 2025

jwhitlock commented Jan 21, 2025

groovecoder commented Jan 24, 2025

jwhitlock commented Jan 24, 2025

groovecoder commented Jan 28, 2025

MPP-3505: Rewrite Account Bearer Token Authentication #5272

Are you sure you want to change the base?

MPP-3505: Rewrite Account Bearer Token Authentication #5272

Conversation

jwhitlock commented Dec 20, 2024 • edited Loading

groovecoder commented Dec 31, 2024

groovecoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwhitlock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwhitlock commented Jan 14, 2025

jwhitlock commented Jan 21, 2025

groovecoder commented Jan 24, 2025

jwhitlock commented Jan 24, 2025

groovecoder commented Jan 28, 2025

jwhitlock commented Dec 20, 2024 •

edited

Loading