-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the use of a ReFrame template config file for a manually created one #850
Replace the use of a ReFrame template config file for a manually created one #850
Conversation
…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app
Instance
|
Instance
|
@laraPPr I think you set
The only thing I would be curious about is if the autodetected CPU topology shows 12 CPUs (i.e. the part that is in the CGROUP for this allocation), or 48. Maybe you can have a look at the generated topology file. Anyway, let me know :) |
Hmmm, so I tested this myself. I had the following config file:
Disappointingly enough, the CPU autodetection still gives the numbers for a full node, e.g.
In a way that's understandable: you don't know which socket you'll land on, so what should it put for the A way out is of course to define the full thing manually. It means we don't have the core layout - but that piece of information is unreliable anyway, since we don't know a-prioro on which core set our build job (which allocates 1/4 of a node) will land anyway. But, I could quite easily define:
|
The Pytorch test don't run when processor information is set in the config file |
And I'm affraid that we will be in the queue for ever waiting for a free node. |
I do already need this https://github.com/laraPPr/software-layer/blob/5c77cb67231057fae05fb86a2c062866aaf5f804/bot/test.sh#L128-L130 |
What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?
I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing? |
Yes it takes very long to get an allocation but maybe in production we should just do a full node. But it could take 24 or more to get an allocation. Because now it starts quickly because I'm only asking for 1 GPU for half an hour. |
|
Can you paste the relevant part from your config describing the CPU topology? |
I think I tried setting
|
@casparvl I just discussed it with Kenneth and his idea was that we introduce another environment variable that would be preferred over what is set in the test_suite.sh script on this line: Line 220 in 274b2dd
Than a site could set different REFRAME_ARGS if they need?
|
That's actually not a bad idea either... But maybe we should do both. I.e. still go to a situation where we require system-specific ReFrame configs to be deployed when a bot is deployed, because there will always be things that are system specific. E.g. the fact that your partition has GPU support. Sure, we could do yet another form of 'autodetection' and say: if In that scenario, the manually created config file should describe the actual partition (i.e. you'd put To specifically tackle your issue: I think the num_cores_per_numa_node is an inferred quantity. I.e. it is probably computed from the
and
sections. I think there is also a
and
Since you have a quarter node, I guess you could try to set
And see if that's a manual config that would work. But honestly, Kenneth's idea might be better. It means we don't have to define these somewhat weird virtual partitions that match 1/4th of a node, and it means we can keep using CPU autodetection (which is something we recommend in our test suite manual as well, so probably good to stick to that ourselves too). |
…ot instance runs.
Crap, I realize one thing: if you set e.g. If we'd have a feature in the bot config that allows us to set additional environment variables per architecture, that would resolve it. Anyway, let's test this first. If it works, let's talk to @trz42 about such a feature, I don't think it should be too difficult. Note that this feature is different from the one implement by Sam Moors, which created support for specifying environment variables as part of the bot-build command (EESSI/eessi-bot-software-layer#281). That's not what we want here, as that would require the person commanding the bot to give the right scale, whereas this should just be a constant for a given bot instance - and thus be part of the bot configuration. |
This is if we would also start doing CPU builds at HPC sites? |
Yes indeed I would also keep the separate config. |
Well it's relevant for any case where your bot architectures do not all assign the same fraction of a node. If it's all 25% (GPU or CPU node, doesn't matter), we can grab that in one |
Anyway, you can test what I did in this PR, just set the |
Ok let's keep an eye on this one #842 (comment) |
@casparvl the test-suite failed with this error I think the cause it this pr.
update: found it I have to set |
test_suite.sh
Outdated
if [ ! -z "$EESSI_ACCELERATOR_TARGET" ]; then | ||
REFRAME_PARTITION_NAME=${REFRAME_PARTITION_NAME}_${EESSI_ACCELERATOR_TARGET//\//_} | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is failing at UGent the path was not added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@casparvl the last it failed with this error:
ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen3'
It did not add the gpu part to the GPU part to the system name that it is looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we gonna have to so something like here
Lines 158 to 159 in 5fdee78
export EESSI_ACCELERATOR_TARGET=$(cfg_get_value "architecture" "accelerator") | |
echo "bot/build.sh: EESSI_ACCELERATOR_TARGET='${EESSI_ACCELERATOR_TARGET}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing something now in #842
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it working by adding this in #842 https://github.com/laraPPr/software-layer/blob/bae19a2adc478471840e94040d7f9cf9a34eae51/test_suite.sh#L88-L107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's another change that was made to bot/build.sh
, but not propagated to bot/test.sh
. We didn't notice, because we weren't running the tests for GPUs :)
So, I'd say your solution is correct, but we should do it at the bot/test.sh
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: I now implemented that in this PR (but can't test myself since I don't have a bot :D)
test_suite.sh
Outdated
# Allow people deploying the bot to overrwide this | ||
if [ -z "$REFRAME_SCALE_TAG" ]; then | ||
REFRAME_SCALE_TAGS="--tag 1_node" | ||
fi | ||
if [ -z "$REFRAME_CI_TAG" ]; then | ||
REFRAME_CI_TAG="--tag CI" | ||
fi | ||
# Allow bot-deployers to add additional args through the environment | ||
if [ -z "$REFRAME_ADDITIONAL_ARGS" ]; then | ||
REFRAME_ADDITIONAL_ARGS="" | ||
fi | ||
export REFRAME_ARGS="${REFRAME_CI_TAG} ${REFRAME_SCALE_TAG} ${REFRAME_ADDITIONAL_ARGS} --nocolor ${REFRAME_NAME_ARGS}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems to have caused problems I'm now getting this error
Listing tests: reframe --tag CI --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS --list
usage: reframe [-h] [--compress-report] [--dont-restage] [--keep-stage-files]
[-o DIR] [--perflogdir DIR] [--prefix DIR] [--report-file FILE]
[--report-junit FILE] [-s DIR] [--save-log-files]
[--timestamp [TIMEFMT]] [-c PATH] [-R] [--cpu-only] [--failed]
[--gpu-only] [--maintainer PATTERN] [-n PATTERN] [-p PATTERN]
[-T PATTERN] [-t PATTERN] [-x PATTERN] [-E EXPR]
[--ci-generate FILE] [--describe] [-L [{C,T}]] [-l [{C,T}]]
[--list-tags] [-r] [--dry-run] [--disable-hook NAME]
[--duration TIMEOUT] [--exec-order ORDER]
[--exec-policy POLICY] [--flex-alloc-nodes {all|STATE|NUM}]
[-J OPT] [--max-retries NUM] [--maxfail NUM] [--mode MODE]
[--reruns N] [--restore-session [REPORT]] [-S [TEST.]VAR=VAL]
[--skip-performance-check] [--skip-prgenv-check]
[--skip-sanity-check] [--skip-system-check] [-M MAPPING]
[-m MOD] [--module-mappings FILE] [--module-path PATH]
[--non-default-craype] [--purge-env] [-u MOD]
[--distribute [{all|avail|STATE}]] [-P VAR:VAL0,VAL1,...]
[--repeat N] [-C FILE] [--detect-host-topology [FILE]]
[--failure-stats] [--nocolor] [--performance-report]
[--show-config [PARAM]] [--system SYSTEM] [-V] [-v] [-q]
reframe: error: unrecognized arguments: --tag CI --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS
^[[31mERROR: Failed to list ReFrame tests with command: reframe --tag CI --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS --list^[[0m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error is still their not sure what is causing it when I copy the command and test it I do not get the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason all the indents were removed in the lines 219-223 testing now if that caused it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... I don't see those indents being removed? Also strange: it seems like a valid list of commands. If I just copy paste the whole thing, it lists the tests just fine. One thing that's missing is the scale tag, because of the typo I had earlier. That should be fixed now. But I don't see how that could cause this.
I have seen this earlier with things like e.g. reframe "${REFRAME_ARGS}"
because then the whole REFRAME_ARGS
is interpreted by the shell as a single argument (which then obviously doesn't exist). The error looks very similar though:
$ reframe "${REFRAME_ARGS}" --list
usage: reframe [-h] [--compress-report] [--dont-restage] [--keep-stage-files] [-o DIR] [--perflogdir DIR] [--prefix DIR] [--report-file FILE]
[--report-junit FILE] [-s DIR] [--save-log-files] [--timestamp [TIMEFMT]] [-c PATH] [-R] [--cpu-only] [--failed] [--gpu-only]
[--maintainer PATTERN] [-n PATTERN] [-p PATTERN] [-T PATTERN] [-t PATTERN] [-x PATTERN] [-E EXPR] [--ci-generate FILE] [--describe]
[-L [{C,T}]] [-l [{C,T}]] [--list-tags] [-r] [--dry-run] [--disable-hook NAME] [--duration TIMEOUT] [--exec-order ORDER]
[--exec-policy POLICY] [--flex-alloc-nodes {all|STATE|NUM}] [--flex-alloc-strict] [-J OPT] [--max-retries NUM] [--maxfail NUM] [--mode MODE]
[--reruns N] [--restore-session [REPORT]] [-S [TEST.]VAR=VAL] [--skip-performance-check] [--skip-prgenv-check] [--skip-sanity-check]
[--skip-system-check] [-M MAPPING] [-m MOD] [--module-mappings FILE] [--module-path PATH] [--non-default-craype] [--purge-env] [-u MOD]
[--distribute [{all|avail|STATE}]] [-P VAR:VAL0,VAL1,...] [--repeat N] [-C FILE] [--detect-host-topology [FILE]] [--failure-stats]
[--nocolor] [--performance-report] [--show-config [PARAM]] [--system SYSTEM] [-V] [-v] [-q]
reframe: error: unrecognized arguments: --tag CI --tag 1_node --nocolor -n OSU
I don't understand how you'd get that though, because ${REFRAME_ARGS}
isn't quoted in test_suite.sh
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, maybe retry now that the REFRAME_SCALE_TAGS
typo is corrected. See if that somehow helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you have this:
'features': [
FEATURES[GPU]
] + list(SCALES.keys()),
as features for your current partition? Or at least something that allows the 1_4_node
scale?
Because if I run
reframe --tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS --list
on Snellius with the standard ReFrame config file (i.e. the one under version control in the test-suite repo) I get the tests listed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were able to recreate it locally, try with -vvvvv
, and see if you can figure out why things get filtered. To reduce the amount of output, you could limit the search path to the lammps test using the -c
argument (-c /kyukon/scratch/gent/461/vsc46128/EESSI/test-suite/eessi/testsuite/tests/apps/lammps
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it works for me as well with
reframe --tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS --list
but if I do this than it doesn't and that is what is now done by the bot
reframe '--tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS' --list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna take a closer look at it next week and figure it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue with ReFrame because I have now idea what is happening and what might cause it to behave like this reframe-hpc/reframe#3369
Co-authored-by: Lara Ramona Peeters <[email protected]>
… as we need it to append the right path to the MODULEPATH for the test_suite.sh
New job on instance
|
Should I also add the gent config to |
Well, this PR supports both options, as I see no reason not to respect the environment variable if it is set. I personally have a preference for Up to you, I guess :) I would advice to keep the |
Ok than this one is ready for merging once the bot-configs one is merged? |
Once #850 (comment) is succesfull, I will remove the BCFTools from this PR - that was just a dummy to prove this thing works. Then, we can merge this PR and the PR to the bot-configs, in no particular order ;-) |
…nctionality of the test step
bot: show-config |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: show_config |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Instance
|
Instance
|
Instance
|
Ok, should be ready for a final review & merge now! |
@Neves-P Will this impact the dev.eessi.io? If so you might need to add a reframe config file to that bot as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks for the heads up! At the moment I think it should be fine because we aren't running the test command (most of the software to test isn't there yet). But from what I understand it's something to keep in mind when we do start running the test script on |
PR merged! Moved |
PR merged! Moved |
PR merged! Moved |
This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app.
@laraPPr I'll send you an example config file that should work with this PR. I'd be great if you can test it for me and let me know if this works. I'll also see if I can find someone with bot access on the AWS MC cluster to deploy the necessary config files and see if I can get it to work there...
WARNING: merging this PR will break any bot instance that has not set up a ReFrame config file manually and has set the
RFM_CONFIG_FILES
environment variable to point to it. Ideally, we should first fix that for all bot instances, and only then merge this PR.