Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-ingest: fix cleanup when a pipeline worker process fails #5549

Merged
merged 2 commits into from
Nov 10, 2023

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Nov 10, 2023

This PR is another piece of the workaround for #5518.

If the buffer for a subprocess on the server side becomes full and therefore a write fails, libsubprocess currently treats this as fatal error, kills the subprocess and returns FLUX_SUBPROCESS_FAILED as subprocess state.

If this occurs for a job-ingest worker, then any futures still in the queue will never be fulfilled and job submissions will appear to hang.

This PR just adds a call to worker_unexpected_exit() for failed worker subprocesses so that any pending futures are fulfilled with error.

Problem: Some functions with long parameter lists are broken using
a block format, but most of the Flux codebase breaks parameters one
per line.

Break long parameter lists with one parameter per line.
Problem: When a subprocess being used as a worker in the job-ingest
pipeline fails, pending futures may not be fulfilled since the
subprocess is immediately destroyed.

Call worker_unexpected_exit() when a worker subprocess fails so that
any pending requests are failed. This completely cleans up the worker
so that no request hang and a new worker will be started, hopefully
resolving the problem instead of hanging the ingest module.
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! 💡 It hadn't quite dawned on me that this was necessary. Good work.

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Merging #5549 (05e1568) into master (f8558d0) will increase coverage by 0.00%.
The diff coverage is 40.00%.

@@           Coverage Diff           @@
##           master    #5549   +/-   ##
=======================================
  Coverage   83.43%   83.43%           
=======================================
  Files         487      487           
  Lines       82286    82287    +1     
=======================================
+ Hits        68657    68660    +3     
+ Misses      13629    13627    -2     
Files Coverage Δ
src/modules/job-ingest/worker.c 76.41% <40.00%> (-0.37%) ⬇️

... and 15 files with indirect coverage changes

@grondo
Copy link
Contributor Author

grondo commented Nov 10, 2023

Thanks! I've set MWP.

@mergify mergify bot merged commit aeac0a1 into flux-framework:master Nov 10, 2023
@grondo grondo deleted the ingest-worker-cleanup branch November 10, 2023 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants