Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerExeption thrown whenever we archive a flag #278

Open
pal-tgiannakopoulos opened this issue Sep 19, 2024 · 12 comments
Open

NullPointerExeption thrown whenever we archive a flag #278

pal-tgiannakopoulos opened this issue Sep 19, 2024 · 12 comments

Comments

@pal-tgiannakopoulos
Copy link

pal-tgiannakopoulos commented Sep 19, 2024

Is this a support request?
No

Describe the bug

      Fatal Exception: java.lang.NullPointerException: Attempt to invoke virtual method 'int java.lang.Object.hashCode()' on a null object reference
   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
   at com.launchdarkly.sdk.android.ContextDataManager.notifyFlagListeners(ContextDataManager.java:317)
   at com.launchdarkly.sdk.android.ContextDataManager.initDataInternal(ContextDataManager.java:179)
   at com.launchdarkly.sdk.android.ContextDataManager.switchToContext(ContextDataManager.java:100)
   at com.launchdarkly.sdk.android.LDClient.identifyInternal(LDClient.java:427)
   at com.launchdarkly.sdk.android.LDClient.identifyInstances(LDClient.java:451)
   at com.launchdarkly.sdk.android.LDClient.identify(LDClient.java:399)

We are assuming that the following code inside ContextDataManager.java is causing a null element to end up in the updatedFlagKeys collections which is used as a parameter to notifyFlagListeners method.

        for (Flag newFlag: newData.values()) {
            Flag oldFlag = oldData.getFlag(newFlag.getKey());
            if (oldFlag == null || !oldFlag.getValue().equals(newFlag.getValue())) {
                // if the flag is new or the value has changed, notify.  This logic can be run if
                // the context changes, which can result in an evaluation change even if the version
                // of the flag stays the same.  You will notice this logic slightly differs from
                // upsert.  Upsert should only be calling to listeners if the value has changed,
                // but we left upsert alone out of fear of that we'd uncover bugs in customer code
                // if we added conditionals in upsert
                updatedFlagKeys.add(newFlag.getKey());
            }
        }

To reproduce
We haven't managed to reproduce this on demand but we are seeing this crash whenever we are archiving a flag.
In our app we allow logging out, which causes a call to identify with an anonymous context.
Our assumption based on user flow logs is that this action is causing the crash to happen probably after a flag is archived.

Expected behavior
To not crash

Logs
No other logs related other than the crashlog in the description

SDK version
Android SDK 5.3.0

Language version, developer tools
Kotlin 1.9.24
Android compile SDK version 34
Android target SDK version 34
Gradle plugin version 8.5.0
Gradle version 8.8

OS/platform
Android regardless of version
Our app runs from Android 8 to Android 14
All of them seem to have the same problem

Additional context
Add any other context about the problem here.

@tanderson-ld
Copy link
Contributor

Hi @pal-tgiannakopoulos, thank you for reporting this issue. Logging internally as SDK-573. We will investigate.

@tanderson-ld
Copy link
Contributor

@pal-tgiannakopoulos, do you have a count of occurrences from crash reporting tooling? What is that as an absolute number and as a percentage of prod customers?

@tanderson-ld
Copy link
Contributor

I did a preliminary investigation. I agree with your conclusion that a null flag key is ending up in that updatedFlagKeys collection, but after having traced the code, I don't see how that is happening.

Are you using polling or streaming? If you haven't specified in the SDK config, the default is streaming.

The SDK uses the flag key that is received from the server and we replace the local representation with a deletedItemPlaceholder, but the key is never nulled intentionally. Here is what is received by the SDK when archiving a flag.

event:delete
data:{"key":"test-archive","version":101}

@pal-tgiannakopoulos
Copy link
Author

hi @tanderson-ld
Thanks for taking a look at this

The crashes we are getting are not a lot. Each time we have archived a flag we got around 30-40 crashes.
We do have around 25K users daily, but the app does not promote logging out, so the combination of having a user that has the app loaded when the archiving happens and that user then tries to logout is very minimal, so we believe it explains the low crash numbers but also the sudden spike that happens.

We don't explicitly set polling or streaming, so I guess we use streaming

As for the actual problem, I don't think your problem is with what the server is sending, but with what is stored when that happens.
I see the line EnvironmentData storedData = getStoredData(currentContext); in ContextDataManager.switchToContext and its result is sent as the newData in the initDataInternal. What is "newData" then? It comes from the DataStore. When the update is received what is stored in the datastore? Is there a chance you have "deleted" the payload from the datastore and leaving nothing in, but the entry still exists? That could end up in a null key

@tanderson-ld
Copy link
Contributor

HI again @pal-tgiannakopoulos , I did some more testing with various combinations of init, identify, and archive at various stages and was unable to reproduce the issue. I also double checked the storage layer and it looks to be operating as expected.

Below is serialized flag data for the flag I was testing with.

Before archive:

"StringVarFlag":{"flagVersion":21,"key":"StringVarFlag","trackEvents":false,"value":"theRuleMatchedLoggedIn","variation":1,"version":116}

After archive but before restart:

"StringVarFlag":{"deleted":true,"key":"StringVarFlag","version":117}

The next app launch, the deleted flag gets wiped out by the incoming payload which does not contain the flag at all.

@tanderson-ld
Copy link
Contributor

tanderson-ld commented Oct 3, 2024

Do you by chance trigger any routines on flag updates being received?

Another thought: I need to verify, but the flag is only marked as deleted client side in the SDK if the phone was connected and streamed the DELETE update that happens when the flag is archived. What do you think are the odds that ~30-40 customers are connected when the flag is archived and then choose to logout? 30/25k is 0.12% which would be the product of the probability(being connected) X probability (logging out).

@pal-tgiannakopoulos
Copy link
Author

hi @tanderson-ld,

yes you are correct about the probabilities. This is my assumption as well, that's probably why we don't see hundreds or thousants of crashes.

And yes I trigger updates on flag updates being received.

I am exposing flags as Flows for my app to know about. These are typically collected from our viewModels.
The Flow is per feature flag.
I use a callbackFlow to wrap the logic of getting updates from LaunchDarkly where I am calling the registerFeatureFlagListener and keeping the coroutine alive with await for as long as the Flow is being collected.

simplified relevant code:

fun getFlagValue(flag: String, defaultValue: Boolean): Flow<Boolean> = callbackFlow {

        val listener = FeatureFlagChangeListener {
            if (isActive) {
                require(it == flag) { "Expected flag key(${flag}), but received update for flag key($it)" }
                trySend(client.boolVariation(flag, defaultValue))
            }
        }
        client.registerFeatureFlagListener(flag, listener)

        send(client.boolVariation(flag, defaultValue))

        awaitClose {
            client.unregisterFeatureFlagListener(flag, listener)
        }
    }

Please note, while I am observing only flags my app knows and cares about, I have seen the crashes happening when archiving flags that are not relevant to the app, so I don't know how much the above logic would be responsible for triggering the crash. This method also does not appear in the crashlog. The crash on my code derives from another Flow collection where the LoginState is observed and the client.identify(context) is being called as a result.

@pal-tgiannakopoulos
Copy link
Author

@tanderson-ld
Can you please try adding .evaluationReasons(true) in your config?
I am using it on mine. Is there a chance this is what is causing the issue for me and you cannot replicate without it?

@tanderson-ld
Copy link
Contributor

Just tried that too and still not able to reproduce the issue. Have you been able to reproduce it locally?

@pal-tgiannakopoulos
Copy link
Author

No I haven't tried.
I cannot pollute our project with test flags, I will need to create a new Launch Darkly app and re-implement.
I will update you if I find time to do this, currently there's limited resource I can spend on this.

@pal-tgiannakopoulos
Copy link
Author

I spent several hours yesterday to replicate this in a controlled environment and I couldn't, so sadly I cannot provide anything else other than the crashlogs in the issue description.

tanderson-ld added a commit that referenced this issue Jan 21, 2025
…itization of an incoming message from LD servers (#288)

After examining related code to a customer support case and #278,
noticed this modification during iteration. Theoretical issue is that
modification during iteration, even if not resulting in
ConcurrentModificationExceptions, could result in unpredictable
iteration and could possibly skip sanitizing certain incoming flags.

Through bench testing and instrumentation testing, I was unable to
reproduce the issue, but I suspect it may be dependent on the specific
platform's implementation of the Map type that GSON will depend on. Will
deploy this fix and work with customers to confirm if the occurrences of
the issue is eliminated.
@tanderson-ld
Copy link
Contributor

Hi again @pal-tgiannakopoulos. Another customer reported this issue, so I re-investigated and during scanning over the code, I noticed a possible culprit but have been unable to reproduce it.

The attempted fix has been merged in this PR and I will be releasing it later today. The description on that PR has the theoretical cause.

I tried to stress test it, but was unable to. If you have the time and desire, could you provide Phone model and OS versions of the devices that your crash reporting tools see the issue happening on? That could help us dig into which versions/source code of map implementations that may be involved on those devices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants