Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix enqueueing of Minion jobs breaking PARALLEL_ONE_HOST_ONLY=1 #6048

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions lib/OpenQA/Resource/Jobs.pm
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ use Mojo::Base -strict, -signatures;

use OpenQA::Jobs::Constants;
use OpenQA::Schema;
use OpenQA::Utils qw(create_git_clone_list);
use Exporter 'import';

our @EXPORT_OK = qw(job_restart);
Expand Down Expand Up @@ -50,8 +49,6 @@ sub job_restart ($jobids, %args) {
my $jobs_rs = $schema->resultset('Jobs');
my $jobs = $jobs_rs->search({id => $jobids, state => {'not in' => [PRISTINE_STATES]}});
$duplication_args{no_directly_chained_parent} = 1 unless $force;
my %clones;
my @clone_ids;
while (my $job = $jobs->next) {
my $job_id = $job->id;
my $missing_assets = $job->missing_assets;
Expand All @@ -76,18 +73,15 @@ sub job_restart ($jobids, %args) {

my $cloned_job_or_error = $job->auto_duplicate(\%duplication_args);
if (ref $cloned_job_or_error) {
create_git_clone_list($job->settings_hash, \%clones);
push @duplicates, $cloned_job_or_error->{cluster_cloned};
push @comments, @{$cloned_job_or_error->{comments_created}};
push @clone_ids, $cloned_job_or_error->{cluster_cloned}->{$job_id};
}
else {
$res{enforceable} = 1 if index($cloned_job_or_error, 'Direct parent ') == 0;
push @errors, ($cloned_job_or_error // "An internal error occurred when duplicating $job_id");
}
push @processed, $job_id;
}
OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids) if keys %clones;

# abort running jobs
return \%res if $args{skip_aborting_jobs};
Expand Down
22 changes: 14 additions & 8 deletions lib/OpenQA/Schema/Result/Jobs.pm
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ use DateTime;
use OpenQA::Constants qw(WORKER_COMMAND_ABORT WORKER_COMMAND_CANCEL);
use OpenQA::Log qw(log_trace log_debug log_info log_warning log_error);
use OpenQA::Utils (
qw(parse_assets_from_settings locate_asset),
qw(create_git_clone_list parse_assets_from_settings locate_asset),
qw(resultdir assetdir read_test_modules find_bugref random_string),
qw(run_cmd_with_log_return_error needledir testcasedir gitrepodir find_video_files)
);
Expand Down Expand Up @@ -693,19 +693,25 @@ sub _create_clones ($self, $jobs, $comments, $comment_text, $comment_user_id, @c
$res->register_assets_from_settings;
}

# calculate blocked_by
$clones{$_}->calculate_blocked_by for @original_job_ids;

# add a reference to the clone within $jobs
for my $job (@original_job_ids) {
my $clone = $clones{$job};
$jobs->{$job}->{clone} = $clone->id if $clone;
my %git_clones;
my @clone_ids;
for my $original_job_id (@original_job_ids) {
my $cloned_job = $clones{$original_job_id};
# calculate blocked_by
$cloned_job->calculate_blocked_by;
# add a reference to the clone within $jobs
push @clone_ids, $jobs->{$original_job_id}->{clone} = $cloned_job->id;
# add Git repositories to clone
create_git_clone_list($cloned_job->settings_hash, \%git_clones);
}

# create comments on original jobs
$result_source->schema->resultset('Comments')
->create_for_jobs(\@original_job_ids, $comment_text, $comment_user_id, $comments)
if defined $comment_text;

# enqueue Minion jobs to clone required Git repositories
OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would create a minion job for every job, only cluster jobs would be grouped together, right?
That would be way too many jobs in some cases.
The detection for identical git_clone tasks is not perfect, especially if they are created quickly after each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this leads to more Minion jobs attempted to be enqueued but I was hoping the code for de-duplicating Minion jobs you have recently introduced will ensure that we don't have too many after all.

Note that there will still only be one enqueuing attempt per job cluster. Only if one specified multiple jobs IDs explicitly (e.g. using #5971 when it gets merged) we would have more enqueuing attempts.

Copy link
Contributor Author

@Martchus Martchus Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we really needed a "preparing" state.

I suppose it would work like this:

  1. We create jobs and now the initial state is "preparing" instead of "scheduled" keeping track of the job IDs.
  2. We enqueue Minion jobs.
  3. We set all jobs we haven't created Minion jobs for in step 2 to "scheduled" immediately.
  4. Before deleting GRU tasks we set related jobs to "scheduled". (This would still not fix the scheduling problem when there's a different set of e.g. download jobs within a parallel cluster - unless we take job dependencies into account here.)

Step 2 and 3 would happen in one transaction. Step 4 would also happen in one transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I suppose I can fix the scheduler first while we think what's best here. With the scheduler fixed we still have this race condition but at least job clusters aren't torn apart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for git_auto_update. Of course then git_auto_update would rely only on the periodic updates for restarted jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand step 4. Maybe you can explain tomorrow after the daily?

OTOH we could try this PR and see how it works out in practice.

Theoretically we could find out if any new minion job is required by keeping track of the git directories in the %git_clones hash while iterating over the jobs to restart. I guess in most cases we will only have one CASEDIR/NEEDLES_DIR or DISTRI.
Just that we would need to pass an additional %git_clones_not_yet_enqueued hash through from the toplevel job_restart, which doesn't sound nice.

Copy link
Contributor

@perlpunk perlpunk Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for git_auto_update. Of course then git_auto_update would rely only on the periodic updates for restarted jobs.

That was a requirement though. We did have a complaint from someone who restarted a job and was wondering why it wasn't using the updated code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I guess this should be:

Suggested change
OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids);
OpenQA::App->singleton->gru->enqueue_git_clones(\%git_clones, \@clone_ids);

Probably the reason for the test failures

Copy link
Contributor Author

@Martchus Martchus Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed introducing a new initial state as alternative when estimating https://progress.opensuse.org/issues/169510.

I guess there would be another alternative: Restart all jobs in one transaction (and also create all the Minion jobs in one transaction). Since we'd just append to the job, job settings and minion jobs tables I don't think this would be problematic (so I don't think it would cause deadlocks/conflicts). Then this route would also get a nice "all or nothing" behavior.

Doing all in one transaction would probably be simpler than introducing a new state and it is essentially what we already do when scheduling products (where also many jobs are created). Of course the scheduling has an async mode where it runs as Minion job so it is a bit different.

Creating a new state means changing many places in the code. It is probably the less fragile approach, though (if implemented correctly / all bugs are ironed out).

Just merging this PR is of course also still on the table. I actually don't think it would be too problematic.

So I'm a bit torn here.

}

# internal (recursive) function for duplicate - returns hash of all jobs in the
Expand Down
2 changes: 1 addition & 1 deletion lib/OpenQA/Shared/Plugin/Gru.pm
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ sub enqueue_git_update_all ($self) {
}

sub enqueue_git_clones ($self, $clones, $job_ids) {
return unless %$clones;
return unless keys %$clones;
return unless OpenQA::App->singleton->config->{'scm git'}->{git_auto_clone} eq 'yes';
# $clones is a hashref with paths as keys and git urls as values
# $job_id is used to create entries in a related table (gru_dependencies)
Expand Down
5 changes: 3 additions & 2 deletions t/api/04-jobs.t
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ $ENV{MOJO_MAX_MESSAGE_SIZE} = 207741824;

my $t = client(Test::Mojo->new('OpenQA::WebAPI'));
my $cfg = $t->app->config;
$cfg->{'scm git'}->{git_auto_clone} = 'no';
$cfg->{'scm git'}->{git_auto_update} = 'no';
is $cfg->{audit}->{blocklist}, 'job_grab', 'blocklist updated';

Expand Down Expand Up @@ -1614,7 +1615,7 @@ subtest 'handle FOO_URL' => sub {
};

subtest 'handle git_clone with CASEDIR' => sub {
OpenQA::App->singleton->config->{'scm git'}->{git_auto_clone} = 'yes';
$cfg->{'scm git'}->{git_auto_clone} = 'yes';
$testsuites->create(
{
name => 'handle_foo_casedir',
Expand Down Expand Up @@ -1645,7 +1646,7 @@ subtest 'handle git_clone with CASEDIR' => sub {
};

subtest 'handle git_clone without CASEDIR' => sub {
OpenQA::App->singleton->config->{'scm git'}->{git_auto_update} = 'yes';
$cfg->{'scm git'}->{git_auto_update} = 'yes';
$testsuites->create(
{
name => 'handle_git_clone',
Expand Down
Loading