Skip to content

Commit

Permalink
Deployed 1610cb0 with MkDocs version: 1.1.2
Browse files Browse the repository at this point in the history
  • Loading branch information
Unknown committed Aug 3, 2024
1 parent 251a1fb commit b11d06b
Show file tree
Hide file tree
Showing 3 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion materials/troubleshooting/part1-ex2-job-retry/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1685,7 +1685,7 @@ <h2 id="bad-job">Bad Job<a class="headerlink" href="#bad-job" title="Permanent l
<p>How many of the jobs succeeded? How many failed?</p>
<h2 id="retrying-failed-jobs">Retrying Failed Jobs<a class="headerlink" href="#retrying-failed-jobs" title="Permanent link">&para;</a></h2>
<p>Now let’s see if we can solve the problem of jobs that fail once in a while. In this particular case, if HTCondor runs a failed job again, it has a good chance of succeeding. Not all failing jobs are like this, but in this case it is a reasonable assumption.</p>
<p>From the lecture materials, implement the <code>max_retries</code> feature to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Did your change work?</p>
<p>HTcondor has a feature named <a href="https://htcondor.readthedocs.io/en/latest/users-manual/automatic-job-management.html#automatically-rerunning-a-failed-job">max_retries</a> that allows to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Try implementing this feature. Did your change work?</p>
<p>After the jobs have finished, examine the log file(s) to see what happened in detail. Did any jobs need to be restarted? Another way to see how many restarts there were is to look at the <code>NumJobStarts</code> attribute of a completed job with the <code>condor_history</code> command, in the same way you looked at the <code>ExitCode</code> attribute earlier. Does the number of retries seem correct? For those jobs which did need to be retried, what is their <code>ExitCode</code>; and what about the <code>ExitCode</code> from earlier execution attempts?</p>
<h2 id="a-too-long-running-job">A (Too) Long Running Job<a class="headerlink" href="#a-too-long-running-job" title="Permanent link">&para;</a></h2>
<p>Sometimes, an ill-behaved job will get stuck in a loop and run forever, instead of exiting with a failure code, and it may just need to be re-run (or run on a different execute server) to complete without getting stuck. We can modify our Python program to simulate this kind of bad job with the following file:</p>
Expand Down
2 changes: 1 addition & 1 deletion search/search_index.json

Large diffs are not rendered by default.

Binary file modified sitemap.xml.gz
Binary file not shown.

0 comments on commit b11d06b

Please sign in to comment.