Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spring Boot v3.4.0 causes our staging & production environment to hang and time out #43332

Closed
bjornharvold opened this issue Nov 30, 2024 · 6 comments
Labels
status: invalid An issue that we don't feel is valid

Comments

@bjornharvold
Copy link

Our services work on all Spring Boot versions prior to v3.4.0 and have been for years. v3.4.0 works in our dev environment and we are unable to reproduce what is occurring in staging and production.

We have 4 Spring Boot apps running on GCP

  • Spring Authorization server (does not seem to be affected)
  • Spring Boot with webflux doesn't seem to be affected
  • Two Spring Boot Web MVC apps are affected

Here is what happens once the deployment hits staging / production:

  • No errors in the startup log
  • Very quickly, if not immediately, after deployment, requested endpoint starts to hang

GCP metrics (response times goes to 300 seconds and times out after release)
Screenshot 2024-11-30 at 15 15 56

MongoDB console (Activity decreases after release)
screencapture-cloud-mongodb-v2-590aeb37c0c6e35c7ce2d87f-2024-11-30-15_09_07

Cloud Run startup log that shows Spring Boot starting up error free and goes directly to timing out
Screenshot 2024-11-30 at 19 47 40

Maven dependency tree:
dependencies.txt.zip

The next step for us would be to turn on logging to high to see anything interesting shows up. Just spent a Saturday afternoon rolling back from production and trying to figure out where it was coming from. At this moment, I am completely clueless.

Any help would be appreciated.

We deploy Spring Boot with mvn spring-boot:build-image and the plugin config looks like this:

<plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <id>process-aot</id>
                        <configuration>
                            <profiles>local,staging,prod</profiles>
                        </configuration>
                    </execution>
                </executions>
                <configuration>
                    <image>
                        <env>
                            <BPE_DELIM_JAVA_TOOL_OPTIONS xml:space="preserve"> </BPE_DELIM_JAVA_TOOL_OPTIONS>
                            <BPE_APPEND_JAVA_TOOL_OPTIONS>--add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.math=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED -Dspring.profiles.active=${environment} -Djava.security.manager=allow</BPE_APPEND_JAVA_TOOL_OPTIONS>
                            <BP_JVM_TYPE>JRE</BP_JVM_TYPE>
                            <BP_JVM_VERSION>${java.version}</BP_JVM_VERSION>
                        </env>
                    </image>
                </configuration>
            </plugin>

Development environment:
Mac OS Sequoia
Atlas CLI
MongoDb 8
Java 23

Providers:
MongoDb Atlas
MongoDb 8
Google Cloud Platform / Cloud run
Java 23

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Nov 30, 2024
@bclozel
Copy link
Member

bclozel commented Nov 30, 2024

I think the critical information here is the state of the JVM threads. This would let us know what's preventing the app from serving requests.

Can you capture this information and let us know?

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Nov 30, 2024
@bjornharvold
Copy link
Author

Hi @bclozel,

Thank you for responding as quickly as you did.

I created an endpoint that would return a thread dump like this:

public String getThreadDump() {
        ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
        ThreadInfo[] threadInfos = threadMXBean.dumpAllThreads(true, true);

        return Arrays.stream(threadInfos)
                .map(ThreadInfo::toString)
                .collect(Collectors.joining("\n"));
    }

Immediately after the application started on CloudRun, I was able to collect the first thread dump here:
thread-dump.txt

I can hit the unauthenticated endpoint "/" an unlimited number of times without any issues.

After I authenticate with Spring Authentication Server 1.4.0 and try to hit a secure endpoint, the server becomes completely unresponsive and I am no longer able to catch a thread dump and Cloud Run times out the connection after 300 seconds.

At this point, the unauthenticated endpoint "/" is no longer responsive regardless if I hit it unauthenticated or authenticated.

Looks like this applies to our other Spring Boot instances as well... the moment I am authenticated with SAS and then try to call any endpoint... 💥☠️ Server hangs.

There is nothing in any of the instance logs, including SAS, that show any errors or any looping or anything out of the ordinary.

Here's a screenshot of what all the instance logs look like after :
Screenshot 2024-12-01 at 12 31 13

I don't see anything out of the ordinary with SAS v1.4.0 release notes in comparison to the SAS version Spring Boot 3.3.6 depends on:
https://github.com/spring-projects/spring-authorization-server/releases/tag/1.4.0

I also don't see anything that should concern us with the latest Spring Data MongoDb release:
https://github.com/spring-projects/spring-data-mongodb/releases/tag/4.4.0

The app that is unresponsive, still logs MongoDb pings as you can see on this screenshot [as if everything is honky dory]:
Screenshot 2024-12-01 at 12 47 40

Let me know how else I can help here. It's not easy capturing a thread dump with Cloud Run when the app is in this, or any, state.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 1, 2024
@bjornharvold
Copy link
Author

Just to follow up on memory. The instance that hangs runs on 2gb of memory w. 1 CPU [and has been doing so for years].

Here's cloud run monitoring screenshot:
screencapture-console-cloud-google-monitoring-dashboards-integration-cloud-run-cloudrun-monitoring-duration-P1D-2024-12-01-12_53_25

@bclozel
Copy link
Member

bclozel commented Dec 1, 2024

It looks like something is blocking threads or some memory leak/infinite recursion.
I’m not a cloud run expert but is there a way to issue a « kill -3 » signal to the container when it’s behaving badly? This should print a thread stack to the console.

The first thread capture doesn’t show anything related to Spring. Maybe you are using Java agents or instrumentation libraries that are not compatible with the latest Spring version?

@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Dec 1, 2024
@bjornharvold
Copy link
Author

There is unfortunately no way to do a kill -3 on a Cloud Run instance.

After your last remark I am leaning towards Sentry being the culprit.

spring-io/start.spring.io#1647
getsentry/sentry-java#3941

Will continue my investigation there. Close issue at will.

Cheers Brian 🍻

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 1, 2024
@bclozel
Copy link
Member

bclozel commented Dec 2, 2024

I don't think we can track the source of the problem without a snapshot of the java threads when the app is having issues. This could come from any library on your classpath, anny java agent or incompatibility with a remote resource. I haven't seen anything so far pointing to Spring Boot causing issues; we can reopen this issue if we find new information.

I'm not familiar enough with Google Cloud run but not being able to connect to the JVM in any way is quite limiting. Maybe is there a way to configure the instance to open a port and connect a profiler to the running JVM?

Closing this issue for now.

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024
@bclozel bclozel added status: invalid An issue that we don't feel is valid and removed status: waiting-for-triage An issue we've not yet triaged status: feedback-provided Feedback has been provided labels Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: invalid An issue that we don't feel is valid
Projects
None yet
Development

No branches or pull requests

3 participants