Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix enqueueing of Minion jobs breaking PARALLEL_ONE_HOST_ONLY=1 #6048

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Martchus
Copy link
Contributor

@Martchus Martchus commented Nov 6, 2024

When restarting openQA jobs, additional Minion jobs are enqueued (of
git_clone task, this is happening especially often when git_auto_clone
is enabled). So far this didn't happen in a transaction. So the scheduler
might see the openQA jobs but not the Minion jobs they are blocked by. This
is problematic because the scheduler might assign jobs too soon to a
worker. This is especially problematic if it does not happen consistently
across a parallel job cluster as it then breaks the
PARALLEL_ONE_HOST_ONLY=1 feature.

Related ticket: https://progress.opensuse.org/issues/169342

}

# create comments on original jobs
$result_source->schema->resultset('Comments')
->create_for_jobs(\@original_job_ids, $comment_text, $comment_user_id, $comments)
if defined $comment_text;

# enqueue Minion jobs to clone required Git repositories
OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would create a minion job for every job, only cluster jobs would be grouped together, right?
That would be way too many jobs in some cases.
The detection for identical git_clone tasks is not perfect, especially if they are created quickly after each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this leads to more Minion jobs attempted to be enqueued but I was hoping the code for de-duplicating Minion jobs you have recently introduced will ensure that we don't have too many after all.

Note that there will still only be one enqueuing attempt per job cluster. Only if one specified multiple jobs IDs explicitly (e.g. using #5971 when it gets merged) we would have more enqueuing attempts.

Copy link
Contributor Author

@Martchus Martchus Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we really needed a "preparing" state.

I suppose it would work like this:

  1. We create jobs and now the initial state is "preparing" instead of "scheduled" keeping track of the job IDs.
  2. We enqueue Minion jobs.
  3. We set all jobs we haven't created Minion jobs for in step 2 to "scheduled" immediately.
  4. Before deleting GRU tasks we set related jobs to "scheduled". (This would still not fix the scheduling problem when there's a different set of e.g. download jobs within a parallel cluster - unless we take job dependencies into account here.)

Step 2 and 3 would happen in one transaction. Step 4 would also happen in one transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I suppose I can fix the scheduler first while we think what's best here. With the scheduler fixed we still have this race condition but at least job clusters aren't torn apart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for git_auto_update. Of course then git_auto_update would rely only on the periodic updates for restarted jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand step 4. Maybe you can explain tomorrow after the daily?

OTOH we could try this PR and see how it works out in practice.

Theoretically we could find out if any new minion job is required by keeping track of the git directories in the %git_clones hash while iterating over the jobs to restart. I guess in most cases we will only have one CASEDIR/NEEDLES_DIR or DISTRI.
Just that we would need to pass an additional %git_clones_not_yet_enqueued hash through from the toplevel job_restart, which doesn't sound nice.

Copy link
Contributor

@perlpunk perlpunk Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for git_auto_update. Of course then git_auto_update would rely only on the periodic updates for restarted jobs.

That was a requirement though. We did have a complaint from someone who restarted a job and was wondering why it wasn't using the updated code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I guess this should be:

Suggested change
OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids);
OpenQA::App->singleton->gru->enqueue_git_clones(\%git_clones, \@clone_ids);

Probably the reason for the test failures

Copy link
Contributor Author

@Martchus Martchus Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed introducing a new initial state as alternative when estimating https://progress.opensuse.org/issues/169510.

I guess there would be another alternative: Restart all jobs in one transaction (and also create all the Minion jobs in one transaction). Since we'd just append to the job, job settings and minion jobs tables I don't think this would be problematic (so I don't think it would cause deadlocks/conflicts). Then this route would also get a nice "all or nothing" behavior.

Doing all in one transaction would probably be simpler than introducing a new state and it is essentially what we already do when scheduling products (where also many jobs are created). Of course the scheduling has an async mode where it runs as Minion job so it is a bit different.

Creating a new state means changing many places in the code. It is probably the less fragile approach, though (if implemented correctly / all bugs are ironed out).

Just merging this PR is of course also still on the table. I actually don't think it would be too problematic.

So I'm a bit torn here.

* Avoid having two times the same identical loop
* Use better variable names
When restarting openQA jobs, additional Minion jobs are enqueued (of
`git_clone` task, this is happening especially often when `git_auto_clone`
is enabled). So far this didn't happen in a transaction. So the scheduler
might see the openQA jobs but not the Minion jobs they are blocked by. This
is problematic because the scheduler might assign jobs too soon to a
worker. This is especially problematic if it does not happen consistently
across a parallel job cluster as it then breaks the
`PARALLEL_ONE_HOST_ONLY=1` feature.

Related ticket: https://progress.opensuse.org/issues/169342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants