Replace the use of a ReFrame template config file for a manually created one #850

casparvl · 2025-01-13T17:06:01Z

This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app.

@laraPPr I'll send you an example config file that should work with this PR. I'd be great if you can test it for me and let me know if this works. I'll also see if I can find someone with bot access on the AWS MC cluster to deploy the necessary config files and see if I can get it to work there...

WARNING: merging this PR will break any bot instance that has not set up a ReFrame config file manually and has set the RFM_CONFIG_FILES environment variable to point to it. Ideally, we should first fix that for all bot instances, and only then merge this PR.

…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app

eessi-bot · 2025-01-13T17:06:05Z

Instance eessi-bot-mc-aws is configured to build for:

architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot · 2025-01-13T17:06:07Z

Instance eessi-bot-mc-azure is configured to build for:

architectures: x86_64/amd/zen4
repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

casparvl · 2025-01-13T17:20:11Z

@laraPPr I think you set RFM_CONFIG_FILES to point to the file below (in the shell session running the bot app), this _should)_work for you:

# reframe_config_bot.py

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_amd_zen3_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    'access': ['--export=None', '--nodes=1', '--cluster=accelgor', '--ntasks-per-node=12', '--gpus-per-node=1' ]
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': __MEM_PER_NODE__,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

The only thing I would be curious about is if the autodetected CPU topology shows 12 CPUs (i.e. the part that is in the CGROUP for this allocation), or 48. Maybe you can have a look at the generated topology file.

Anyway, let me know :)

casparvl · 2025-01-14T12:19:12Z

Hmmm, so I tested this myself. I had the following config file:

$ cat example_reframe_config.py
# WARNING: this file is intended as template and the __X__ template variables need to be replaced
# before it can act as a configuration file
# Once replaced, this is a config file for running tests after the build phase, by the bot

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_intel_icelake_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    # Suppose that we have configured the bot with
                    # slurm_params = --hold --nodes=1 --export=None --time=0:30:0
                    # arch_target_map = {
                    #     "linux/x86_64/amd/zen3" : "--partition=gpu --ntasks-per-node=12 --gpus-per-node 1" }
                    # We would specify the relevant parameters as access flags:
                    'access': ['--export=None', '--nodes=1', '--partition=gpu_a100', '--ntasks-per-node=18', '--gpus-per-node=1' ],
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': 491520,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

Disappointingly enough, the CPU autodetection still gives the numbers for a full node, e.g.

...
    "sockets": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],
...
...
  "num_cpus": 72,
  "num_cpus_per_core": 1,
  "num_cpus_per_socket": 36,
  "num_sockets": 2
}

In a way that's understandable: you don't know which socket you'll land on, so what should it put for the sockets field: "0x000000000fffffffff" or "0xfffffffff000000000"? That would depend on which part of the node your job happens to land on.

A way out is of course to define the full thing manually. It means we don't have the core layout - but that piece of information is unreliable anyway, since we don't know a-prioro on which core set our build job (which allocates 1/4 of a node) will land anyway. But, I could quite easily define:

{
    "num_cpus": 18,
    "num_cpus_per_core": 1,
    "num_cpus_per_socket": 18,
    "num_sockets": 1
}
manually. We'd have to see if the tests don't request any information outside of this, but I think (at least for now) they don't.

Anyway, unless your bot is allocating full nodes, we should probably turn off CPU autodetection and specify CPU topology manually in the ReFrame config file...

laraPPr · 2025-01-14T12:22:46Z

The Pytorch test don't run when processor information is set in the config file

laraPPr · 2025-01-14T12:23:59Z

And I'm affraid that we will be in the queue for ever waiting for a free node.

laraPPr · 2025-01-14T12:26:23Z

I do already need this https://github.com/laraPPr/software-layer/blob/5c77cb67231057fae05fb86a2c062866aaf5f804/bot/test.sh#L128-L130
So maybe We should do something similar for the reframe command?

casparvl · 2025-01-14T13:13:36Z

The Pytorch test don't run when processor information is set in the config file

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

And I'm affraid that we will be in the queue for ever waiting for a free node.

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

laraPPr · 2025-01-14T14:08:47Z

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

Yes it takes very long to get an allocation but maybe in production we should just do a full node. But it could take 24 or more to get an allocation. Because now it starts quickly because I'm only asking for 1 GPU for half an hour.

laraPPr · 2025-01-14T14:26:45Z

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

Reason: attribute error: ../../../../../../../scratch/gent/461/vsc46128/EESSI/test-suite/eessi/testsuite/utils.py:163: Processor information (num_cores_per_numa_node) missing. Check that processor information is either autodetected (see https://reframe-hpc.readthedocs.io/en/stable/configure.html#proc-autodetection), or manually set in the ReFrame configuration file (see https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#processor-info).

    raise AttributeError(msg)

casparvl · 2025-01-15T09:56:18Z

Can you paste the relevant part from your config describing the CPU topology?

laraPPr · 2025-01-15T14:34:08Z

I think I tried setting num_cores_per_numa_node but I did not really know what to put their.

                        'num_cpus': 96,
                        'num_sockets': 2,
                        'num_cpus_per_socket': 48,
                        'num_cpus_per_core': 1,
                        'arch': 'zen2',

laraPPr · 2025-01-15T15:53:58Z

@casparvl I just discussed it with Kenneth and his idea was that we introduce another environment variable that would be preferred over what is set in the test_suite.sh script on this line:

software-layer/test_suite.sh

Line 220 in 274b2dd

export REFRAME_ARGS="--tag CI --tag 1_node --nocolor ${REFRAME_NAME_ARGS}"

.
Than a site could set different REFRAME_ARGS if they need?

casparvl · 2025-01-16T09:43:20Z

That's actually not a bad idea either... But maybe we should do both. I.e. still go to a situation where we require system-specific ReFrame configs to be deployed when a bot is deployed, because there will always be things that are system specific. E.g. the fact that your partition has GPU support. Sure, we could do yet another form of 'autodetection' and say: if nvidia-smi exists and returns some GPU, we should add the GPU feature to your config (with a replacement in the current template config file). But I think this just gets messy fast. What if you want to add the feature 'ALWAYS_REQUEST_GPUS', because that's needed on your system? Or if you need to do additional initialization commands?

In that scenario, the manually created config file should describe the actual partition (i.e. you'd put devices: 4 for a 4-GPU node, rather than the 1 that I tried). That way, you could still use CPU autodetection (which detects the topology for a full node).

To specifically tackle your issue: I think the num_cores_per_numa_node is an inferred quantity. I.e. it is probably computed from the

  "topology": {
    "numa_nodes": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],

and

  "num_cpus": 72,

sections. I think there is also a num_cores_per_socket, which is probably inferred from

    "sockets": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],

and

   "num_cpus": 72,

Since you have a quarter node, I guess you could try to set

  "topology": {
    "numa_nodes": [
      "0x000000000000000000",
    ],
...
"num_cpus": 24  # in your case, I think this is a quarter node

And see if that's a manual config that would work.

But honestly, Kenneth's idea might be better. It means we don't have to define these somewhat weird virtual partitions that match 1/4th of a node, and it means we can keep using CPU autodetection (which is something we recommend in our test suite manual as well, so probably good to stick to that ourselves too).

…ot instance runs.

casparvl · 2025-01-16T09:54:49Z

Crap, I realize one thing: if you set e.g. REFRAME_SCALE_TAG=--tags 1_4_node to the session running the bot instance, this will apply to all of your jobs. What if you have a bot instance that assigns quarter GPU nodes, but full CPU nodes for building?

If we'd have a feature in the bot config that allows us to set additional environment variables per architecture, that would resolve it. Anyway, let's test this first. If it works, let's talk to @trz42 about such a feature, I don't think it should be too difficult. Note that this feature is different from the one implement by Sam Moors, which created support for specifying environment variables as part of the bot-build command (EESSI/eessi-bot-software-layer#281). That's not what we want here, as that would require the person commanding the bot to give the right scale, whereas this should just be a constant for a given bot instance - and thus be part of the bot configuration.

laraPPr · 2025-01-16T10:38:31Z

Crap, I realize one thing: if you set e.g. REFRAME_SCALE_TAG=--tags 1_4_node to the session running the bot instance, this will apply to all of your jobs. What if you have a bot instance that assigns quarter GPU nodes, but full CPU nodes for building?

This is if we would also start doing CPU builds at HPC sites?

laraPPr · 2025-01-16T10:41:00Z

That's actually not a bad idea either... But maybe we should do both. I.e. still go to a situation where we require system-specific ReFrame configs to be deployed when a bot is deployed, because there will always be things that are system specific. E.g. the fact that your partition has GPU support. Sure, we could do yet another form of 'autodetection' and say: if nvidia-smi exists and returns some GPU, we should add the GPU feature to your config (with a replacement in the current template config file). But I think this just gets messy fast. What if you want to add the feature 'ALWAYS_REQUEST_GPUS', because that's needed on your system? Or if you need to do additional initialization commands?

And see if that's a manual config that would work.

But honestly, Kenneth's idea might be better. It means we don't have to define these somewhat weird virtual partitions that match 1/4th of a node, and it means we can keep using CPU autodetection (which is something we recommend in our test suite manual as well, so probably good to stick to that ourselves too).

Yes indeed I would also keep the separate config.

casparvl · 2025-01-16T13:12:23Z

This is if we would also start doing CPU builds at HPC sites

Well it's relevant for any case where your bot architectures do not all assign the same fraction of a node. If it's all 25% (GPU or CPU node, doesn't matter), we can grab that in one REFRAME_SCALE_TAG. But if it varies per architecture, we need to be able to set an architecture-specific REFRAME_SCALE_TAG.

casparvl · 2025-01-16T13:13:36Z

Anyway, you can test what I did in this PR, just set the REFRAME_SCALE_TAG="--tag 1_4_node", keep your local config (throw out the hardecoded CPU topology part, simply rely on the autodetection). And then give it a go and keep your fingers crossed ;-)

laraPPr · 2025-01-16T14:00:53Z

Ok let's keep an eye on this one #842 (comment)

laraPPr · 2025-01-17T11:37:49Z

@casparvl the test-suite failed with this error I think the cause it this pr. RFM_CONFIG_FILES was set on the machine where the bot is running but not in the job. So is that the problem?

^[[32mSuccesfully found and imported eessi.testsuite^[[0m

./test_suite.sh: line 139: err_msg: command not found

./test_suite.sh: line 140: err_msg: command not found

./test_suite.sh: line 141: err_msg: command not found

^[[31mERROR: ^[[0m

update: found it I have to set --export=NONE in the slurm parameters on our system because otherwise it is one big mess. So I'll just set RFM_CONFIG_FILES in the .bashrc of the install user.

test_suite.sh

laraPPr · 2025-01-20T11:04:44Z

test_suite.sh

+if [ ! -z "$EESSI_ACCELERATOR_TARGET" ]; then
+    REFRAME_PARTITION_NAME=${REFRAME_PARTITION_NAME}_${EESSI_ACCELERATOR_TARGET//\//_}
+fi


This check is failing at UGent the path was not added.

@casparvl the last it failed with this error:

ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen3'

It did not add the gpu part to the GPU part to the system name that it is looking for.

I think we gonna have to so something like here

software-layer/bot/build.sh

Lines 158 to 159 in 5fdee78

export EESSI_ACCELERATOR_TARGET=$(cfg_get_value "architecture" "accelerator")

echo "bot/build.sh: EESSI_ACCELERATOR_TARGET='${EESSI_ACCELERATOR_TARGET}'"

testing something now in #842

Got it working by adding this in #842 https://github.com/laraPPr/software-layer/blob/bae19a2adc478471840e94040d7f9cf9a34eae51/test_suite.sh#L88-L107

I guess that's another change that was made to bot/build.sh, but not propagated to bot/test.sh. We didn't notice, because we weren't running the tests for GPUs :)

So, I'd say your solution is correct, but we should do it at the bot/test.sh level.

FYI: I now implemented that in this PR (but can't test myself since I don't have a bot :D)

laraPPr · 2025-01-20T15:51:06Z

test_suite.sh

+# Allow people deploying the bot to overrwide this
+if [ -z "$REFRAME_SCALE_TAG" ]; then
+    REFRAME_SCALE_TAGS="--tag 1_node"
+fi
+if [ -z "$REFRAME_CI_TAG" ]; then
+    REFRAME_CI_TAG="--tag CI"
+fi
+# Allow bot-deployers to add additional args through the environment
+if [ -z "$REFRAME_ADDITIONAL_ARGS" ]; then
+    REFRAME_ADDITIONAL_ARGS=""
+fi
+export REFRAME_ARGS="${REFRAME_CI_TAG} ${REFRAME_SCALE_TAG} ${REFRAME_ADDITIONAL_ARGS} --nocolor ${REFRAME_NAME_ARGS}"


This change seems to have caused problems I'm now getting this error

Listing tests: reframe --tag CI --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS --list usage: reframe [-h] [--compress-report] [--dont-restage] [--keep-stage-files] [-o DIR] [--perflogdir DIR] [--prefix DIR] [--report-file FILE] [--report-junit FILE] [-s DIR] [--save-log-files] [--timestamp [TIMEFMT]] [-c PATH] [-R] [--cpu-only] [--failed] [--gpu-only] [--maintainer PATTERN] [-n PATTERN] [-p PATTERN] [-T PATTERN] [-t PATTERN] [-x PATTERN] [-E EXPR] [--ci-generate FILE] [--describe] [-L [{C,T}]] [-l [{C,T}]] [--list-tags] [-r] [--dry-run] [--disable-hook NAME] [--duration TIMEOUT] [--exec-order ORDER] [--exec-policy POLICY] [--flex-alloc-nodes {all|STATE|NUM}] [-J OPT] [--max-retries NUM] [--maxfail NUM] [--mode MODE] [--reruns N] [--restore-session [REPORT]] [-S [TEST.]VAR=VAL] [--skip-performance-check] [--skip-prgenv-check] [--skip-sanity-check] [--skip-system-check] [-M MAPPING] [-m MOD] [--module-mappings FILE] [--module-path PATH] [--non-default-craype] [--purge-env] [-u MOD] [--distribute [{all|avail|STATE}]] [-P VAR:VAL0,VAL1,...] [--repeat N] [-C FILE] [--detect-host-topology [FILE]] [--failure-stats] [--nocolor] [--performance-report] [--show-config [PARAM]] [--system SYSTEM] [-V] [-v] [-q] reframe: error: unrecognized arguments: --tag CI --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS ^[[31mERROR: Failed to list ReFrame tests with command: reframe --tag CI --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS --list^[[0m

Error is still their not sure what is causing it when I copy the command and test it I do not get the error

for some reason all the indents were removed in the lines 219-223 testing now if that caused it

Hm... I don't see those indents being removed? Also strange: it seems like a valid list of commands. If I just copy paste the whole thing, it lists the tests just fine. One thing that's missing is the scale tag, because of the typo I had earlier. That should be fixed now. But I don't see how that could cause this.

I have seen this earlier with things like e.g. reframe "${REFRAME_ARGS}" because then the whole REFRAME_ARGS is interpreted by the shell as a single argument (which then obviously doesn't exist). The error looks very similar though:

$ reframe "${REFRAME_ARGS}" --list usage: reframe [-h] [--compress-report] [--dont-restage] [--keep-stage-files] [-o DIR] [--perflogdir DIR] [--prefix DIR] [--report-file FILE] [--report-junit FILE] [-s DIR] [--save-log-files] [--timestamp [TIMEFMT]] [-c PATH] [-R] [--cpu-only] [--failed] [--gpu-only] [--maintainer PATTERN] [-n PATTERN] [-p PATTERN] [-T PATTERN] [-t PATTERN] [-x PATTERN] [-E EXPR] [--ci-generate FILE] [--describe] [-L [{C,T}]] [-l [{C,T}]] [--list-tags] [-r] [--dry-run] [--disable-hook NAME] [--duration TIMEOUT] [--exec-order ORDER] [--exec-policy POLICY] [--flex-alloc-nodes {all|STATE|NUM}] [--flex-alloc-strict] [-J OPT] [--max-retries NUM] [--maxfail NUM] [--mode MODE] [--reruns N] [--restore-session [REPORT]] [-S [TEST.]VAR=VAL] [--skip-performance-check] [--skip-prgenv-check] [--skip-sanity-check] [--skip-system-check] [-M MAPPING] [-m MOD] [--module-mappings FILE] [--module-path PATH] [--non-default-craype] [--purge-env] [-u MOD] [--distribute [{all|avail|STATE}]] [-P VAR:VAL0,VAL1,...] [--repeat N] [-C FILE] [--detect-host-topology [FILE]] [--failure-stats] [--nocolor] [--performance-report] [--show-config [PARAM]] [--system SYSTEM] [-V] [-v] [-q] reframe: error: unrecognized arguments: --tag CI --tag 1_node --nocolor -n OSU

I don't understand how you'd get that though, because ${REFRAME_ARGS} isn't quoted in test_suite.sh...

Anyway, maybe retry now that the REFRAME_SCALE_TAGS typo is corrected. See if that somehow helps.

And you have this:

'features': [ FEATURES[GPU] ] + list(SCALES.keys()),

as features for your current partition? Or at least something that allows the 1_4_node scale?

Because if I run

reframe --tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS --list

on Snellius with the standard ReFrame config file (i.e. the one under version control in the test-suite repo) I get the tests listed correctly.

If you were able to recreate it locally, try with -vvvvv, and see if you can figure out why things get filtered. To reduce the amount of output, you could limit the search path to the lammps test using the -c argument (-c /kyukon/scratch/gent/461/vsc46128/EESSI/test-suite/eessi/testsuite/tests/apps/lammps)

Yes it works for me as well with
reframe --tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS --list
but if I do this than it doesn't and that is what is now done by the bot
reframe '--tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS' --list

I'm gonna take a closer look at it next week and figure it out

I opened an issue with ReFrame because I have now idea what is happening and what might cause it to behave like this reframe-hpc/reframe#3369

test_suite.sh

Co-authored-by: Lara Ramona Peeters <[email protected]>

… as we need it to append the right path to the MODULEPATH for the test_suite.sh

eessi-bot · 2025-01-28T22:47:22Z

New job on instance eessi-bot-mc-azure for CPU micro-architecture x86_64-amd-zen4 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.01/pr_850/68

date	job status	comment
Jan 28 22:47:21 UTC 2025	submitted	job id `68` awaits release by job manager
Jan 28 22:47:57 UTC 2025	released	job awaits launch by Slurm scheduler
Jan 28 22:53:02 UTC 2025	running	job `68` is running
Jan 28 23:06:48 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-68.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen4-1738105003.tar.gz` size: 1 MiB (1247976 bytes) entries: 69 modules under 2023.06/software/linux/x86_64/amd/zen4/modules/all `BCFtools/1.19-GCC-13.2.0.lua` software under 2023.06/software/linux/x86_64/amd/zen4/software `BCFtools/1.19-GCC-13.2.0` other under 2023.06/software/linux/x86_64/amd/zen4 no other files in tarball
Jan 28 23:06:48 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86_64_amd_zen4+default P: perf: 1799.795 timesteps/s (r:0, l:None, u:None) [ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86_64_amd_zen4+default P: perf: 1783.286 timesteps/s (r:0, l:None, u:None) [ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86_64_amd_zen4+default P: latency: 4.35 us (r:0, l:None, u:None) [ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86_64_amd_zen4+default P: latency: 4.01 us (r:0, l:None, u:None) [ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86_64_amd_zen4+default P: latency: 10.9 us (r:0, l:None, u:None) [ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86_64_amd_zen4+default P: latency: 13.11 us (r:0, l:None, u:None) [ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86_64_amd_zen4+default P: latency: 0.55 us (r:0, l:None, u:None) [ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86_64_amd_zen4+default P: latency: 0.55 us (r:0, l:None, u:None) [ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86_64_amd_zen4+default P: bandwidth: 49779.97 MB/s (r:0, l:None, u:None) [ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86_64_amd_zen4+default P: bandwidth: 49612.69 MB/s (r:0, l:None, u:None) [ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-68.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

laraPPr · 2025-01-28T22:50:16Z

Should I also add the gent config to shared_fs_path or is it ok to leave it in the bashrc?

casparvl · 2025-01-28T22:59:03Z

Well, this PR supports both options, as I see no reason not to respect the environment variable if it is set.

I personally have a preference for shared_fs_path because it is the one path that is guaranteed to be available to all bot jobs (by design). It also means that when deploying a new bot, you only have to fiddle in one directory (this directory also contains your host-injections).

Up to you, I guess :) I would advice to keep the reframe_config.py under version control, in the same repo as your bot config (as I have done for the AWS and Azure clusters, see the relevant PR there)

laraPPr · 2025-01-28T23:01:51Z

Ok than this one is ready for merging once the bot-configs one is merged?

casparvl · 2025-01-28T23:03:31Z

Once #850 (comment) is succesfull, I will remove the BCFTools from this PR - that was just a dummy to prove this thing works. Then, we can merge this PR and the PR to the bot-configs, in no particular order ;-)

…nctionality of the test step

laraPPr · 2025-01-28T23:07:26Z

bot: show-config

eessi-bot · 2025-01-28T23:07:29Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command show-config from laraPPr
- expanded format: show-config
handling command show-config failed with message
unknown command show-config; use bot: help for usage information

eessi-bot · 2025-01-28T23:07:29Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command show-config from laraPPr
- expanded format: show-config
handling command show-config failed with message
unknown command show-config; use bot: help for usage information

gpu-bot-ugent · 2025-01-28T23:07:29Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command show-config from laraPPr
- expanded format: show-config
handling command show-config failed with message
unknown command show-config; use bot: help for usage information

laraPPr · 2025-01-28T23:07:59Z

bot: help

gpu-bot-ugent · 2025-01-28T23:08:02Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

eessi-bot · 2025-01-28T23:08:02Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

eessi-bot · 2025-01-28T23:08:03Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

laraPPr · 2025-01-28T23:08:20Z

bot: show_config

eessi-bot · 2025-01-28T23:08:23Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command show_config from laraPPr
- expanded format: show_config
handling command show_config resulted in:
- added comment Replace the use of a ReFrame template config file for a manually created one #850 (comment) to show configuration

eessi-bot · 2025-01-28T23:08:23Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command show_config from laraPPr
- expanded format: show_config
handling command show_config resulted in:
- added comment Replace the use of a ReFrame template config file for a manually created one #850 (comment) to show configuration

gpu-bot-ugent · 2025-01-28T23:08:23Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command show_config from laraPPr
- expanded format: show_config
handling command show_config resulted in:
- added comment Replace the use of a ReFrame template config file for a manually created one #850 (comment) to show configuration

eessi-bot · 2025-01-28T23:08:25Z

Instance eessi-bot-mc-aws is configured to build for:

architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot · 2025-01-28T23:08:26Z

Instance eessi-bot-mc-azure is configured to build for:

architectures: x86_64/amd/zen4
repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

gpu-bot-ugent · 2025-01-28T23:08:27Z

Instance eessi-bot-vsc-ugent is configured to build for:

architectures: x86_64/amd/zen3
repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi-hpc.org-2023.06-compat

casparvl · 2025-01-28T23:10:01Z

Ok, should be ready for a final review & merge now!

laraPPr · 2025-01-28T23:14:38Z

@Neves-P Will this impact the dev.eessi.io? If so you might need to add a reframe config file to that bot as well.

laraPPr

lgtm

Neves-P · 2025-01-29T08:31:48Z

@Neves-P Will this impact the dev.eessi.io? If so you might need to add a reframe config file to that bot as well.

Thanks for the heads up! At the moment I think it should be fine because we aren't running the test command (most of the software to test isn't there yet). But from what I understand it's something to keep in mind when we do start running the test script on dev.eessi.io

eessi-bot · 2025-01-29T08:34:40Z

PR merged! Moved ['/project/def-users/SHARED/jobs/2025.01/pr_850/43200', '/project/def-users/SHARED/jobs/2025.01/pr_850/43201', '/project/def-users/SHARED/jobs/2025.01/pr_850/43202', '/project/def-users/SHARED/jobs/2025.01/pr_850/43203', '/project/def-users/SHARED/jobs/2025.01/pr_850/43204', '/project/def-users/SHARED/jobs/2025.01/pr_850/43205', '/project/def-users/SHARED/jobs/2025.01/pr_850/43206', '/project/def-users/SHARED/jobs/2025.01/pr_850/43208', '/project/def-users/SHARED/jobs/2025.01/pr_850/43209', '/project/def-users/SHARED/jobs/2025.01/pr_850/43210', '/project/def-users/SHARED/jobs/2025.01/pr_850/43211', '/project/def-users/SHARED/jobs/2025.01/pr_850/43212', '/project/def-users/SHARED/jobs/2025.01/pr_850/43213', '/project/def-users/SHARED/jobs/2025.01/pr_850/43214', '/project/def-users/SHARED/jobs/2025.01/pr_850/43215', '/project/def-users/SHARED/jobs/2025.01/pr_850/43216', '/project/def-users/SHARED/jobs/2025.01/pr_850/43217', '/project/def-users/SHARED/jobs/2025.01/pr_850/43218'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.01.29

eessi-bot · 2025-01-29T08:34:40Z

PR merged! Moved ['/project/def-users/SHARED/jobs/2025.01/pr_850/68'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.01.29

gpu-bot-ugent · 2025-01-29T08:34:41Z

PR merged! Moved [] to /scratch/gent/vo/002/gvo00211/SHARED/trash_bin/EESSI/software-layer/2025.01.29

Replace the use of a ReFrame template config file for a manually crea…

65e4c36

…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app

Make the ReFrame args configurable through environment in which the b…

d8794a1

…ot instance runs.

laraPPr pushed a commit to laraPPr/software-layer that referenced this pull request Jan 16, 2025

merge with EESSI#850

8e5368a

laraPPr requested changes Jan 17, 2025

View reviewed changes

test_suite.sh Outdated Show resolved Hide resolved

laraPPr reviewed Jan 20, 2025

View reviewed changes

laraPPr requested changes Jan 20, 2025

View reviewed changes

test_suite.sh Outdated Show resolved Hide resolved

casparvl and others added 2 commits January 21, 2025 03:03

Apply suggestions from code review

5fc7332

Co-authored-by: Lara Ramona Peeters <[email protected]>

Make sure the EESSI_ACCELERATOR_TARGET is also set for the test step,…

ebe999f

… as we need it to append the right path to the MODULEPATH for the test_suite.sh

Remove BCFtools from this PR, it was just meant to demonstrate the fu…

61d4617

…nctionality of the test step

Removed white line and comments

896f2d4

laraPPr approved these changes Jan 28, 2025

View reviewed changes

laraPPr merged commit 902a20e into EESSI:2023.06-software.eessi.io Jan 29, 2025
49 checks passed

	export EESSI_ACCELERATOR_TARGET=$(cfg_get_value "architecture" "accelerator")
	echo "bot/build.sh: EESSI_ACCELERATOR_TARGET='${EESSI_ACCELERATOR_TARGET}'"

Replace the use of a ReFrame template config file for a manually created one #850

Replace the use of a ReFrame template config file for a manually created one #850

Conversation

casparvl commented Jan 13, 2025 • edited Loading

eessi-bot bot commented Jan 13, 2025

eessi-bot bot commented Jan 13, 2025

casparvl commented Jan 13, 2025 • edited Loading

casparvl commented Jan 14, 2025 • edited Loading

laraPPr commented Jan 14, 2025

laraPPr commented Jan 14, 2025 • edited Loading

laraPPr commented Jan 14, 2025

casparvl commented Jan 14, 2025

laraPPr commented Jan 14, 2025

laraPPr commented Jan 14, 2025

casparvl commented Jan 15, 2025

laraPPr commented Jan 15, 2025

laraPPr commented Jan 15, 2025

casparvl commented Jan 16, 2025

casparvl commented Jan 16, 2025

laraPPr commented Jan 16, 2025

laraPPr commented Jan 16, 2025

casparvl commented Jan 16, 2025

casparvl commented Jan 16, 2025

laraPPr commented Jan 16, 2025 • edited Loading

laraPPr commented Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laraPPr Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

casparvl Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

casparvl Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laraPPr Jan 22, 2025 • edited by boegel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eessi-bot bot commented Jan 28, 2025 • edited Loading

laraPPr commented Jan 28, 2025

casparvl commented Jan 28, 2025

laraPPr commented Jan 28, 2025

casparvl commented Jan 28, 2025

laraPPr commented Jan 28, 2025

eessi-bot bot commented Jan 28, 2025 • edited Loading

eessi-bot bot commented Jan 28, 2025 • edited Loading

gpu-bot-ugent bot commented Jan 28, 2025 • edited Loading

laraPPr commented Jan 28, 2025

gpu-bot-ugent bot commented Jan 28, 2025 • edited Loading

eessi-bot bot commented Jan 28, 2025 • edited Loading

eessi-bot bot commented Jan 28, 2025 • edited Loading

laraPPr commented Jan 28, 2025

eessi-bot bot commented Jan 28, 2025 • edited Loading

eessi-bot bot commented Jan 28, 2025 • edited Loading

gpu-bot-ugent bot commented Jan 28, 2025 • edited Loading

eessi-bot bot commented Jan 28, 2025

eessi-bot bot commented Jan 28, 2025

gpu-bot-ugent bot commented Jan 28, 2025

casparvl commented Jan 28, 2025

laraPPr commented Jan 28, 2025

laraPPr left a comment

Choose a reason for hiding this comment

Neves-P commented Jan 29, 2025

eessi-bot bot commented Jan 29, 2025

eessi-bot bot commented Jan 29, 2025

gpu-bot-ugent bot commented Jan 29, 2025

casparvl commented Jan 13, 2025 •

edited

Loading

casparvl commented Jan 13, 2025 •

edited

Loading

casparvl commented Jan 14, 2025 •

edited

Loading

laraPPr commented Jan 14, 2025 •

edited

Loading

laraPPr commented Jan 16, 2025 •

edited

Loading

laraPPr commented Jan 17, 2025 •

edited

Loading

laraPPr Jan 20, 2025 •

edited

Loading

casparvl Jan 21, 2025 •

edited

Loading

casparvl Jan 22, 2025 •

edited

Loading

laraPPr Jan 22, 2025 •

edited by boegel

Loading

eessi-bot bot commented Jan 28, 2025 •

edited

Loading

eessi-bot bot commented Jan 28, 2025 •

edited

Loading

eessi-bot bot commented Jan 28, 2025 •

edited

Loading

gpu-bot-ugent bot commented Jan 28, 2025 •

edited

Loading

gpu-bot-ugent bot commented Jan 28, 2025 •

edited

Loading

eessi-bot bot commented Jan 28, 2025 •

edited

Loading

eessi-bot bot commented Jan 28, 2025 •

edited

Loading

eessi-bot bot commented Jan 28, 2025 •

edited

Loading

eessi-bot bot commented Jan 28, 2025 •

edited

Loading

gpu-bot-ugent bot commented Jan 28, 2025 •

edited

Loading