Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/Verify Increase in 500 errors #101926

Open
andrea-merrill opened this issue Jan 28, 2025 · 2 comments
Open

Investigate/Verify Increase in 500 errors #101926

andrea-merrill opened this issue Jan 28, 2025 · 2 comments
Assignees

Comments

@andrea-merrill
Copy link

andrea-merrill commented Jan 28, 2025

We're seeing an increase in 500 errors over the past 24 hours. We'd like someone to look at the 500 errors received on 1/27 and 1/28 and verify the following:

  • Are these related to the file not found error?
  • Are these the result of users resubmitting after being shown a 500 error?
    • If so, find uuid and email address and/or name
    • Find all submissions that have the same email address and/or name
    • Confirm that at least one submission was complete and successful
  • Can we confirm that a 500 error was shown to the user?
  • If a new error is occurring that we were unaware of, bring findings back to team to discuss and create new tickets to further investigate/solve

Image

@cloudmagic80
Copy link
Contributor

cloudmagic80 commented Jan 30, 2025

Are these related to the file not found error? Yes, I suspect this error is related to the fix we tried to make for the error. Here is the [User Story.]
Are these the result of users resubmitting after being shown a 500 error? I'm not sure yet. I will need to make more test
Can we confirm that a 500 error was shown to the user? Yes, this error will show that the form wasn't submitted to the s3 bucket.

@cloudmagic80
Copy link
Contributor

cloudmagic80 commented Jan 31, 2025

What's causing the spike in 500 errors in Datadog this week?

We know that this PR was merged into Main on January 21, 2025, at 2:56 PM CST, and subsequently released to Production on January 22, 2025, at 12:00 PM CST. However, the increase in 500 errors on Datadog began on January 27 at 9:00 AM CST, indicating that the spike is likely caused by another factor.

Breakdown of Datadog 500 Errors (Jan 27 - Jan 30):
Image

The majority of these errors are related to the PDF stamping process. In all 83 occurrences, the failures happened within their respective Kubernetes clusters, meaning this is unlikely to be caused by processes shifting between pods.

The earliest error timestamp was at 5:00 AM CST, and the latest was around 11:00 PM CST.
Additionally, I cross-checked the UUIDs from the Datadog 500 errors against the database we sent to the PEGA team. None of the failed UUIDs matched those that successfully sent confirmation emails to users. This strongly suggests that users encountered an error message upon form submission. Given this pattern, these errors are highly likely to be Non-Zero Silent Failures.

Next Steps: Investigating the Cause of the Increase
Debug and Inspect client.put_object Return Value:
The most critical next step is determining exactly what client.put_object is returning.
We should add logging immediately after the client.put_object call in the upload method to capture this data.
Let me know if you have any thoughts or additional insights!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants