Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Spacewalk Kubernetes issue #479

Closed
bogdanS98 opened this issue Jan 23, 2024 · 1 comment
Closed

Investigate Spacewalk Kubernetes issue #479

bogdanS98 opened this issue Jan 23, 2024 · 1 comment
Assignees

Comments

@bogdanS98
Copy link
Contributor

bogdanS98 commented Jan 23, 2024

Context

Further investigate the Spacewalk Kubernetes issue of stuck connection to Stellar overlay network in a timebox (at most 2 days)
Kubernetes local deployment of both the runner and standalone vault binary works without issues.

Requirement

  • Compare configuration JSON between EC2 and EKS deployments
  • Compare vault images/versions between EC2 and EKS deployments
  • Update tokio library to the latest version

Findings

Configuration JSON files and Docker images used are the same between EC2 and EKS deployments, but the issue is still present.
Even after upgrading tokio library to the latest version (1.35), the issue was still present.

Extended testing

However, I tested a similar Kubernetes setup from scratch in both EKS and GKE (on free-tier accounts, unrelated to our org).

Testing setup

Kubernetes spec files used in the tests are defined here. Instead of building the runner from scratch, I used the same runner binary present in our production deployments on EC2 and EKS.

Tested/Checked the following:

  • Increasing/removing resource limits
  • Allowing all network traffic
  • Creating multiple Docker images for the runner with different dependencies and different distros as base image (ubuntu:focal, ubuntu:latest, alpine:latest)
  • Same Linux kernel version in our production EC2 and EKS cluster as on free-tier EKS and GKE clusters

Results

Issue was still present in all tests mentioned above.
After code changes that replaced tokio with async-std, the issue still persisted when using the runner but everything was working fine when running the standalone vault binary in Kubernetes.

Conclusion is that the issue is coming from the runner code as described in this ticket.

@bogdanS98 bogdanS98 self-assigned this Jan 23, 2024
@ebma ebma transferred this issue from pendulum-chain/pendulum Jan 23, 2024
@ebma
Copy link
Member

ebma commented Feb 23, 2024

Thanks for documenting your findings @bogdanS98 👍. I'll close this ticket now so that we can continue the investigations in https://github.com/pendulum-chain/tasks/issues/207.

@ebma ebma closed this as completed Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants