-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Distributed Debugger] Unable to use debugger on Ray Cluster on k8s: deubgpy.listen(...) crashes silently briefly after being called #49014
Comments
Cc: @brycehuang30 |
Thanks for submitting this report -- do you have some more information about your environment? I tried to repro it with |
Thanks for trying to reproduce! When I run it directly on a k8s pod it works fine. When I run it in a ray task (on k8s), it breaks. Have you tried running it in a ray task? |
Thanks, I also have trouble reproducing it when running in a ray task: (base) ray@raycluster-sample-head-nq2z8:~$ cat test.py
import ray
@ray.remote
def f():
import os
import socket
import glob
import debugpy
import time
DEBUGPY_LOGFILE_PATTERN = "/tmp/debugpy.*.log"
def is_port_in_use(port: int) -> bool:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
return s.connect_ex(("localhost", port)) == 0
def start_and_check_debugpy() -> None:
clean_debugpy_log_files() # clean any leftovers from previous test runs
os.environ["DEBUGPY_LOG_DIR"] = "/tmp"
PORT = 5678 # can be a random port too.
print(f"before assert {PORT} is free")
assert is_port_in_use(PORT) == False, f"Port {PORT} is in use, but should not."
print(f"after assert {PORT} is free")
print(f"before listen {PORT}")
debugpy.listen(PORT)
print(f"after listen {PORT}")
# Seems like it takes some time for the listener to crash in our troubled remote case
time.sleep(10)
print_debugpy_log_files()
print(f"before assert {PORT} is in use")
assert is_port_in_use(PORT) == True, f"Port {PORT} is not in use, but should be."
print(f"after assert {PORT} is in use")
def clean_debugpy_log_files() -> None:
files = glob.glob(DEBUGPY_LOGFILE_PATTERN)
for file_path in files:
os.remove(file_path)
print(f"Deleted: {file_path}")
def print_debugpy_log_files() -> None:
files = glob.glob(DEBUGPY_LOGFILE_PATTERN)
print(f"Printing contents of {DEBUGPY_LOGFILE_PATTERN} files")
for file_path in files:
with open(file_path) as file:
contents = file.read()
print(f"Filename: {file_path}")
print("Contents:")
print(contents)
print("-" * 40)
start_and_check_debugpy()
ray.get(f.remote()) and then
I also ran it with Ray 2.39 like you did -- if you are able to repro it in Same result when I do |
I got the debugger working end-to-end on KubeRay btw and documented the steps to run it here #49116 If the problem persists, please describe your environment in more detail :) |
Thanks, this is very helpful! I'll try if we can get it working with |
…#49116) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This addresses #45541 and #49014 ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Philipp Moritz <[email protected]> Co-authored-by: angelinalg <[email protected]>
…ray-project#49116) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This addresses ray-project#45541 and ray-project#49014 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Philipp Moritz <[email protected]> Co-authored-by: angelinalg <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
What happened + What you expected to happen
I've been trying to use Ray Distributed Debugger on a Ray Cluster on k8s, but so far without luck. After doing some digging I found out that the
debugpy.listen(...)
call in the task silently fails (note:debugpy.listen
is injected as part of the task hitting its firstbreakpoint()
, as implemented here).The main consequence is that I'm unable to connect to the task that's waiting for a client to connect (
ECONNREFUSED
). When I ssh to the client and enable extra debugpy logs (see repro script), I can see that nothing is bound to the port that it's supposed to be listening on and that the debugpy logs contain aBrokenPipeError
. From my investigation, it seems thatBrokenPipeError
is the easiest way to determine if things are healthy or broken.Related tickets:
Versions / Dependencies
Ray 2.39.0, debugpy 1.8.8, Ubuntu 20.04.6
Reproduction script
This demo doesn't need a
breakpoint()
orRAY_DEBUG=1
or so. It just sets up debugpy listening and shows that it crashes silently withBrokenPipeError
, leaving the listening port unbound. The same issue also occurs when using the actual ray distributed debugger (random port, bound to external ip), this example is just more minimal/isolated.How to use:
Call
start_and_check_debugpy()
in a single task ray job (single, as I set a fixed listen port, to avoid multiple tasks from trying to bind the same port).I can see that nothing is bound to the port (the assert will fail) and that
BrokenPipeError
is in the opened/tmp/debugpy.pydevd.*.log
file.When I run the same code locally or on the K8S pod outside of ray, things work fine.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: