Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray debugger] Unable to use debugger on Ray Cluster on k8s #45541

Open
yx367563 opened this issue May 24, 2024 · 11 comments
Open

[Ray debugger] Unable to use debugger on Ray Cluster on k8s #45541

yx367563 opened this issue May 24, 2024 · 11 comments
Assignees
Labels
dashboard Issues specific to the Ray Dashboard enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@yx367563
Copy link

What happened + What you expected to happen

I tried to use the debugger plug-in on VScode according to the guidance(https://www.anyscale.com/blog/ray-distributed-debugger), but when I click on a paused task to attach the VSCode debugger, I always get an error connect ECONNREFUSED $ip:port.
I tried to enable the plug-in locally and it worked normally.
I also tried to add the ray-debugger-external flag and tested that the Ray Cluster on k8s can enable the native debugger normally.
I don’t know how to use the debugger plug-in on VScode on the Ray Cluster on k8s. Can you provide relevant guidance or help?

Versions / Dependencies

Ray 2.23.0
Python 3.10.12

Reproduction script

Sample code in guidance

Issue Severity

High: It blocks me from completing my task.

@yx367563 yx367563 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 24, 2024
@yx367563
Copy link
Author

Or do I need to configure launch.json in vscode?

@anyscalesam anyscalesam added enhancement Request for new feature and/or capability and removed bug Something that is supposed to be working; but isn't labels May 24, 2024
@anyscalesam anyscalesam added the dashboard Issues specific to the Ray Dashboard label Jun 3, 2024
@brycehuang30 brycehuang30 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 7, 2024
@brycehuang30 brycehuang30 self-assigned this Jun 7, 2024
@rasmus-unity
Copy link

I think the problem is that Ray debugger uses a random port, so it's not possible to know ahead which port to open when running on Kubernetes

From https://github.com/ray-project/ray/blob/master/python/ray/util/debugpy.py:

def _ensure_debugger_port_open_thread_safe():
            (...)
            (host, port) = debugpy.listen(
                (ray._private.worker.global_worker.node_ip_address, 0)
            )

And from definition of listen() in https://github.com/microsoft/debugpy/blob/main/src/debugpy/public_api.

This may be different from address if port was 0 in the latter, in which case the adapter will pick some unused ephemeral port to listen on.

In our case we're running ephemeral Ray clusters using RayJob resource definition from KubeRay, so we could specify a single port. In case of static Ray clusters, could a port range be a solution?

@anyscalesam
Copy link
Contributor

@brycehuang30 does the new distributed debugger have this capability? if we don't I say we build forward and add this as a feature request to that.

@brycehuang30
Copy link
Contributor

Distributed debugger currently cannot custom the debugging ports. I think we could solve this in two steps:

  1. let debugger only use a fixed range of port numbers, e.g. 50000 - 51000. This allows users to set the port open in k8s
  2. enable custom port range setting, so users could choose the port range

@rasmus-unity
Copy link

Thanks for looking into this. One more thing that needs to be considered when debugging a job running in Kubernetes, is the IP address

From Ray job log:

2024-10-03 11:07:18,021	INFO debugpy.py:66 -- Ray debugger is listening on 100.104.4.3:34983
2024-10-03 11:07:18,023	INFO debugpy.py:87 -- Waiting for debugger to attach...

That IP address 100.104.4.3 is internal to the Kubernetes cluster, so when trying to debug from VS Code, getting error
(In this case 127.0.0.1:8265 is being port-forwarded from Ray dashboard running in Kubernetes)
image

Possibly VS Code debugger plugin should connect to the external IP address of head node, and not the internal node IP address?

@koenlek
Copy link

koenlek commented Nov 30, 2024

I've run into an issue that seems very similar to this one. In fact, it might very well be the same issue.

I'm using Ray 2.30 and I get a connection refused error when I try to connect vs code to the paused task. I noticed that debugpy on the task actually crashes soon after debugpy.listen(...) is called, so by the time I'm trying to connect vs code, nothing is listening on the configured port anymore (the port as printed in the Ray debugger is listening on <ip>:<port> log msg)

  • I also tried ray 2.39: same issue
  • I tried patching ray to make debugpy run on a fixed port, and/or localhost/0.0.0.0 ip (combined with kubectl port forwarding): all to no avail. In all cases, the root issue seems that nothing is listening anymore on the port where debugpy is supposed to listen.
  • I tried running debugpy.listen on a k8s pod without ray, and in that case it works fine. Using lsof I can see that something is listening on the configured listen port.
  • The underlying debugpy crash is hard to detect, apart from that nothing is listening on the port... However, if you enable extra logging you can see it crash (BrokenPipeError) in the logs though. I reported this issue in debugpy here (with details on how to find the crash msg in debugpy.pydevd.NNNN.log): debugpy listen silently crashing microsoft/debugpy#1749

pcmoritz added a commit that referenced this issue Dec 9, 2024
…#49116)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This addresses #45541 and
#49014

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Philipp Moritz <[email protected]>
Co-authored-by: angelinalg <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Dec 17, 2024
…ray-project#49116)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This addresses ray-project#45541 and
ray-project#49014

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Philipp Moritz <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
@rogerfydp
Copy link

Distributed debugger currently cannot custom the debugging ports. I think we could solve this in two steps:

  1. let debugger only use a fixed range of port numbers, e.g. 50000 - 51000. This allows users to set the port open in k8s
  2. enable custom port range setting, so users could choose the port range

We are facing the same issue with our local Ray Cluster but in our case behind docker-compose for local development/testing.

We were wondering if the suggested solution like an optional parameter to specify debugpy ports is still an option or if there is any other recommendation to overcome the issue.

@rasmus-unity
Copy link

We ended up deploying https://docs.linuxserver.io/images/docker-code-server/ inside the Kubernetes cluster, which can then access the necessary ports

@Moonquakes
Copy link

Moonquakes commented Jan 23, 2025

@rasmus-unity Thank you for sharing. Can you explain the specific operation process?

At the same time, I noticed that there is a relevant pr (#49116). Can I assume that this requirement can be met by referring to this document? cc @brycehuang30

@rogerfydp
Copy link

@rasmus-unity and @Moonquakes, thank you for your insights!

To test this, we created a Dockerfile based on Ray images and installed the SSH server as mentioned in #49116, along with other necessary components. We were seeking a solution for agile local development and debugging, so we also ended up having to mount the source code being developed as volumes on the Ray head node and installed various tools such as Devbox that we require for developing. This setup allowed us to develop directly on the Ray head and utilize the Ray Distributed Debugger extension, but we believe this is so much complexity added aside from installing unnecessary stuff on the ray-head that could potentially be overcome.

While this approach was useful and for the moment makes the trick for us, we still believe that having an out-of-the-box solution without the need to install SSH servers and other possible dependencies would be extremely valuable having already the amazing provided Ray Distributed Debugger extension. Implementing a way to configure a range of ports for debugpy to listen on, as previously suggested by @brycehuang30, would greatly enhance the development experience in our opinion.

@Moonquakes
Copy link

Hi @rogerfydp , Could you explain your operation steps and dockerfile in more detail? I installed ssh according to the instructions in #49116, and opened port 22. It seems that there will be other problems. Kuberay will open some ports by default when the port is not filled in, but it will not be added if 22 is added manually (https://github.com/ray-project/kuberay/blob/v1.2.2/ray-operator/controllers/ray/common/service.go#L409-L417).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dashboard Issues specific to the Ray Dashboard enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

7 participants