Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BugFix: Abort on certain response codes and retry on others #387

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion client/src/nv_ingest_client/client/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,19 @@ def _fetch_job_result(self, job_index: str, timeout: float = 100, data_only: boo
# Only pop once we know we've successfully decoded the response or errored out
_ = self._pop_job_state(job_index)
else:
raise TimeoutError(f"Timeout: No response within {timeout} seconds for job ID {job_index}")
# There are a plethora of codes that can be thrown. Some offer specific insights while
# others can be grouped into a general failure category. We check for specific codes here
# and then generally error on the others.
if response.response_code == 404:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to make the generic client aware of HTTP response codes. It's also used for Redis/SimpleBroker, and whatever else we may need to support in the future.

Maybe we just have recoverable vs non-recoverable response codes that it knows about and translate errors from the different client instances into those?

# job_id not found on serverside. This condition will not alleviate itself with a retry
raise RuntimeError(f"JobId: {job_state.job_id} not found - Reason: {response.response_reason}")
elif response.response_code == 500:
# properly propagated server side error
raise RuntimeError(f"Response: {response.response} - Reason: {response.response_reason}")
else:
# Generalized errors group. These errors are ones that could potentially be resolved and
# therefore should be retried. Existing logic works based on TimeoutErrors so we raise here.
raise TimeoutError(f"Response: {response.response} - Reason: {response.response_reason}")

except TimeoutError:
raise
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ def fetch_message(self, job_id: str, timeout: float = 10) -> ResponseSchema:
if response_code in _TERMINAL_RESPONSE_STATUSES:
# Terminal response code; return error ResponseSchema
return ResponseSchema(
response_code=1,
response_code=response_code,
response_reason=(
f"Terminal response code {response_code} received when fetching JobSpec: {job_id}"
),
Expand Down
Loading