Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the use of a ReFrame template config file for a manually created one #850

Merged

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented Jan 13, 2025

This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app.

@laraPPr I'll send you an example config file that should work with this PR. I'd be great if you can test it for me and let me know if this works. I'll also see if I can find someone with bot access on the AWS MC cluster to deploy the necessary config files and see if I can get it to work there...

WARNING: merging this PR will break any bot instance that has not set up a ReFrame config file manually and has set the RFM_CONFIG_FILES environment variable to point to it. Ideally, we should first fix that for all bot instances, and only then merge this PR.

…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app
Copy link

eessi-bot bot commented Jan 13, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Jan 13, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@casparvl
Copy link
Collaborator Author

casparvl commented Jan 13, 2025

@laraPPr I think you set RFM_CONFIG_FILES to point to the file below (in the shell session running the bot app), this _should)_work for you:

# reframe_config_bot.py

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_amd_zen3_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    'access': ['--export=None', '--nodes=1', '--cluster=accelgor', '--ntasks-per-node=12', '--gpus-per-node=1' ]
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': __MEM_PER_NODE__,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

The only thing I would be curious about is if the autodetected CPU topology shows 12 CPUs (i.e. the part that is in the CGROUP for this allocation), or 48. Maybe you can have a look at the generated topology file.

Anyway, let me know :)

@casparvl
Copy link
Collaborator Author

casparvl commented Jan 14, 2025

Hmmm, so I tested this myself. I had the following config file:

$ cat example_reframe_config.py
# WARNING: this file is intended as template and the __X__ template variables need to be replaced
# before it can act as a configuration file
# Once replaced, this is a config file for running tests after the build phase, by the bot

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_intel_icelake_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    # Suppose that we have configured the bot with
                    # slurm_params = --hold --nodes=1 --export=None --time=0:30:0
                    # arch_target_map = {
                    #     "linux/x86_64/amd/zen3" : "--partition=gpu --ntasks-per-node=12 --gpus-per-node 1" }
                    # We would specify the relevant parameters as access flags:
                    'access': ['--export=None', '--nodes=1', '--partition=gpu_a100', '--ntasks-per-node=18', '--gpus-per-node=1' ],
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': 491520,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

Disappointingly enough, the CPU autodetection still gives the numbers for a full node, e.g.

...
    "sockets": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],
...
...
  "num_cpus": 72,
  "num_cpus_per_core": 1,
  "num_cpus_per_socket": 36,
  "num_sockets": 2
}

In a way that's understandable: you don't know which socket you'll land on, so what should it put for the sockets field: "0x000000000fffffffff" or "0xfffffffff000000000"? That would depend on which part of the node your job happens to land on.

A way out is of course to define the full thing manually. It means we don't have the core layout - but that piece of information is unreliable anyway, since we don't know a-prioro on which core set our build job (which allocates 1/4 of a node) will land anyway. But, I could quite easily define:

{
    "num_cpus": 18,
    "num_cpus_per_core": 1,
    "num_cpus_per_socket": 18,
    "num_sockets": 1
}
manually. We'd have to see if the tests don't request any information outside of this, but I think (at least for now) they don't.

Anyway, unless your bot is allocating full nodes, we should probably turn off CPU autodetection and specify CPU topology manually in the ReFrame config file...

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

The Pytorch test don't run when processor information is set in the config file

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

And I'm affraid that we will be in the queue for ever waiting for a free node.

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

I do already need this https://github.com/laraPPr/software-layer/blob/5c77cb67231057fae05fb86a2c062866aaf5f804/bot/test.sh#L128-L130
So maybe We should do something similar for the reframe command?

@casparvl
Copy link
Collaborator Author

The Pytorch test don't run when processor information is set in the config file

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

And I'm affraid that we will be in the queue for ever waiting for a free node.

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

Yes it takes very long to get an allocation but maybe in production we should just do a full node. But it could take 24 or more to get an allocation. Because now it starts quickly because I'm only asking for 1 GPU for half an hour.

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

Reason: attribute error: ../../../../../../../scratch/gent/461/vsc46128/EESSI/test-suite/eessi/testsuite/utils.py:163: Processor information (num_cores_per_numa_node) missing. Check that processor information is either autodetected (see https://reframe-hpc.readthedocs.io/en/stable/configure.html#proc-autodetection), or manually set in the ReFrame configuration file (see https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#processor-info).

    raise AttributeError(msg)

@casparvl
Copy link
Collaborator Author

Can you paste the relevant part from your config describing the CPU topology?

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 15, 2025

I think I tried setting num_cores_per_numa_node but I did not really know what to put their.

                        'num_cpus': 96,
                        'num_sockets': 2,
                        'num_cpus_per_socket': 48,
                        'num_cpus_per_core': 1,
                        'arch': 'zen2',

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 15, 2025

@casparvl I just discussed it with Kenneth and his idea was that we introduce another environment variable that would be preferred over what is set in the test_suite.sh script on this line:

export REFRAME_ARGS="--tag CI --tag 1_node --nocolor ${REFRAME_NAME_ARGS}"
.
Than a site could set different REFRAME_ARGS if they need?

@casparvl
Copy link
Collaborator Author

That's actually not a bad idea either... But maybe we should do both. I.e. still go to a situation where we require system-specific ReFrame configs to be deployed when a bot is deployed, because there will always be things that are system specific. E.g. the fact that your partition has GPU support. Sure, we could do yet another form of 'autodetection' and say: if nvidia-smi exists and returns some GPU, we should add the GPU feature to your config (with a replacement in the current template config file). But I think this just gets messy fast. What if you want to add the feature 'ALWAYS_REQUEST_GPUS', because that's needed on your system? Or if you need to do additional initialization commands?

In that scenario, the manually created config file should describe the actual partition (i.e. you'd put devices: 4 for a 4-GPU node, rather than the 1 that I tried). That way, you could still use CPU autodetection (which detects the topology for a full node).

To specifically tackle your issue: I think the num_cores_per_numa_node is an inferred quantity. I.e. it is probably computed from the

  "topology": {
    "numa_nodes": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],

and

  "num_cpus": 72,

sections. I think there is also a num_cores_per_socket, which is probably inferred from

    "sockets": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],

and

   "num_cpus": 72,

Since you have a quarter node, I guess you could try to set

  "topology": {
    "numa_nodes": [
      "0x000000000000000000",
    ],
...
"num_cpus": 24  # in your case, I think this is a quarter node

And see if that's a manual config that would work.

But honestly, Kenneth's idea might be better. It means we don't have to define these somewhat weird virtual partitions that match 1/4th of a node, and it means we can keep using CPU autodetection (which is something we recommend in our test suite manual as well, so probably good to stick to that ourselves too).

@casparvl
Copy link
Collaborator Author

Crap, I realize one thing: if you set e.g. REFRAME_SCALE_TAG=--tags 1_4_node to the session running the bot instance, this will apply to all of your jobs. What if you have a bot instance that assigns quarter GPU nodes, but full CPU nodes for building?

If we'd have a feature in the bot config that allows us to set additional environment variables per architecture, that would resolve it. Anyway, let's test this first. If it works, let's talk to @trz42 about such a feature, I don't think it should be too difficult. Note that this feature is different from the one implement by Sam Moors, which created support for specifying environment variables as part of the bot-build command (EESSI/eessi-bot-software-layer#281). That's not what we want here, as that would require the person commanding the bot to give the right scale, whereas this should just be a constant for a given bot instance - and thus be part of the bot configuration.

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 16, 2025

Crap, I realize one thing: if you set e.g. REFRAME_SCALE_TAG=--tags 1_4_node to the session running the bot instance, this will apply to all of your jobs. What if you have a bot instance that assigns quarter GPU nodes, but full CPU nodes for building?

This is if we would also start doing CPU builds at HPC sites?

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 16, 2025

That's actually not a bad idea either... But maybe we should do both. I.e. still go to a situation where we require system-specific ReFrame configs to be deployed when a bot is deployed, because there will always be things that are system specific. E.g. the fact that your partition has GPU support. Sure, we could do yet another form of 'autodetection' and say: if nvidia-smi exists and returns some GPU, we should add the GPU feature to your config (with a replacement in the current template config file). But I think this just gets messy fast. What if you want to add the feature 'ALWAYS_REQUEST_GPUS', because that's needed on your system? Or if you need to do additional initialization commands?

And see if that's a manual config that would work.

But honestly, Kenneth's idea might be better. It means we don't have to define these somewhat weird virtual partitions that match 1/4th of a node, and it means we can keep using CPU autodetection (which is something we recommend in our test suite manual as well, so probably good to stick to that ourselves too).

Yes indeed I would also keep the separate config.

@casparvl
Copy link
Collaborator Author

This is if we would also start doing CPU builds at HPC sites

Well it's relevant for any case where your bot architectures do not all assign the same fraction of a node. If it's all 25% (GPU or CPU node, doesn't matter), we can grab that in one REFRAME_SCALE_TAG. But if it varies per architecture, we need to be able to set an architecture-specific REFRAME_SCALE_TAG.

@casparvl
Copy link
Collaborator Author

Anyway, you can test what I did in this PR, just set the REFRAME_SCALE_TAG="--tag 1_4_node", keep your local config (throw out the hardecoded CPU topology part, simply rely on the autodetection). And then give it a go and keep your fingers crossed ;-)

laraPPr pushed a commit to laraPPr/software-layer that referenced this pull request Jan 16, 2025
@laraPPr
Copy link
Collaborator

laraPPr commented Jan 16, 2025

Ok let's keep an eye on this one #842 (comment)

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 17, 2025

@casparvl the test-suite failed with this error I think the cause it this pr. RFM_CONFIG_FILES was set on the machine where the bot is running but not in the job. So is that the problem?

^[[32mSuccesfully found and imported eessi.testsuite^[[0m

./test_suite.sh: line 139: err_msg: command not found

./test_suite.sh: line 140: err_msg: command not found

./test_suite.sh: line 141: err_msg: command not found

^[[31mERROR: ^[[0m

update: found it I have to set --export=NONE in the slurm parameters on our system because otherwise it is one big mess. So I'll just set RFM_CONFIG_FILES in the .bashrc of the install user.

test_suite.sh Outdated Show resolved Hide resolved
test_suite.sh Outdated
Comment on lines 149 to 151
if [ ! -z "$EESSI_ACCELERATOR_TARGET" ]; then
REFRAME_PARTITION_NAME=${REFRAME_PARTITION_NAME}_${EESSI_ACCELERATOR_TARGET//\//_}
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is failing at UGent the path was not added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@casparvl the last it failed with this error:

ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen3'

It did not add the gpu part to the GPU part to the system name that it is looking for.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we gonna have to so something like here

software-layer/bot/build.sh

Lines 158 to 159 in 5fdee78

export EESSI_ACCELERATOR_TARGET=$(cfg_get_value "architecture" "accelerator")
echo "bot/build.sh: EESSI_ACCELERATOR_TARGET='${EESSI_ACCELERATOR_TARGET}'"

Copy link
Collaborator

@laraPPr laraPPr Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing something now in #842

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's another change that was made to bot/build.sh, but not propagated to bot/test.sh. We didn't notice, because we weren't running the tests for GPUs :)

So, I'd say your solution is correct, but we should do it at the bot/test.sh level.

Copy link
Collaborator Author

@casparvl casparvl Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I now implemented that in this PR (but can't test myself since I don't have a bot :D)

test_suite.sh Outdated
Comment on lines 237 to 248
# Allow people deploying the bot to overrwide this
if [ -z "$REFRAME_SCALE_TAG" ]; then
REFRAME_SCALE_TAGS="--tag 1_node"
fi
if [ -z "$REFRAME_CI_TAG" ]; then
REFRAME_CI_TAG="--tag CI"
fi
# Allow bot-deployers to add additional args through the environment
if [ -z "$REFRAME_ADDITIONAL_ARGS" ]; then
REFRAME_ADDITIONAL_ARGS=""
fi
export REFRAME_ARGS="${REFRAME_CI_TAG} ${REFRAME_SCALE_TAG} ${REFRAME_ADDITIONAL_ARGS} --nocolor ${REFRAME_NAME_ARGS}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to have caused problems I'm now getting this error

Listing tests: reframe --tag CI   --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS --list

usage: reframe [-h] [--compress-report] [--dont-restage] [--keep-stage-files]

               [-o DIR] [--perflogdir DIR] [--prefix DIR] [--report-file FILE]

               [--report-junit FILE] [-s DIR] [--save-log-files]

               [--timestamp [TIMEFMT]] [-c PATH] [-R] [--cpu-only] [--failed]

               [--gpu-only] [--maintainer PATTERN] [-n PATTERN] [-p PATTERN]

               [-T PATTERN] [-t PATTERN] [-x PATTERN] [-E EXPR]

               [--ci-generate FILE] [--describe] [-L [{C,T}]] [-l [{C,T}]]

               [--list-tags] [-r] [--dry-run] [--disable-hook NAME]

               [--duration TIMEOUT] [--exec-order ORDER]

               [--exec-policy POLICY] [--flex-alloc-nodes {all|STATE|NUM}]

               [-J OPT] [--max-retries NUM] [--maxfail NUM] [--mode MODE]

               [--reruns N] [--restore-session [REPORT]] [-S [TEST.]VAR=VAL]

               [--skip-performance-check] [--skip-prgenv-check]

               [--skip-sanity-check] [--skip-system-check] [-M MAPPING]

               [-m MOD] [--module-mappings FILE] [--module-path PATH]

               [--non-default-craype] [--purge-env] [-u MOD]

               [--distribute [{all|avail|STATE}]] [-P VAR:VAL0,VAL1,...]

               [--repeat N] [-C FILE] [--detect-host-topology [FILE]]

               [--failure-stats] [--nocolor] [--performance-report]

               [--show-config [PARAM]] [--system SYSTEM] [-V] [-v] [-q]

reframe: error: unrecognized arguments: --tag CI   --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS

^[[31mERROR: Failed to list ReFrame tests with command: reframe --tag CI   --nocolor -n EESSI_OSU_Micro_Benchmarks -n EESSI_LAMMPS --list^[[0m

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error is still their not sure what is causing it when I copy the command and test it I do not get the error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason all the indents were removed in the lines 219-223 testing now if that caused it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I don't see those indents being removed? Also strange: it seems like a valid list of commands. If I just copy paste the whole thing, it lists the tests just fine. One thing that's missing is the scale tag, because of the typo I had earlier. That should be fixed now. But I don't see how that could cause this.

I have seen this earlier with things like e.g. reframe "${REFRAME_ARGS}" because then the whole REFRAME_ARGS is interpreted by the shell as a single argument (which then obviously doesn't exist). The error looks very similar though:

$ reframe "${REFRAME_ARGS}" --list
usage: reframe [-h] [--compress-report] [--dont-restage] [--keep-stage-files] [-o DIR] [--perflogdir DIR] [--prefix DIR] [--report-file FILE]
               [--report-junit FILE] [-s DIR] [--save-log-files] [--timestamp [TIMEFMT]] [-c PATH] [-R] [--cpu-only] [--failed] [--gpu-only]
               [--maintainer PATTERN] [-n PATTERN] [-p PATTERN] [-T PATTERN] [-t PATTERN] [-x PATTERN] [-E EXPR] [--ci-generate FILE] [--describe]
               [-L [{C,T}]] [-l [{C,T}]] [--list-tags] [-r] [--dry-run] [--disable-hook NAME] [--duration TIMEOUT] [--exec-order ORDER]
               [--exec-policy POLICY] [--flex-alloc-nodes {all|STATE|NUM}] [--flex-alloc-strict] [-J OPT] [--max-retries NUM] [--maxfail NUM] [--mode MODE]
               [--reruns N] [--restore-session [REPORT]] [-S [TEST.]VAR=VAL] [--skip-performance-check] [--skip-prgenv-check] [--skip-sanity-check]
               [--skip-system-check] [-M MAPPING] [-m MOD] [--module-mappings FILE] [--module-path PATH] [--non-default-craype] [--purge-env] [-u MOD]
               [--distribute [{all|avail|STATE}]] [-P VAR:VAL0,VAL1,...] [--repeat N] [-C FILE] [--detect-host-topology [FILE]] [--failure-stats]
               [--nocolor] [--performance-report] [--show-config [PARAM]] [--system SYSTEM] [-V] [-v] [-q]
reframe: error: unrecognized arguments: --tag CI --tag 1_node  --nocolor -n OSU

I don't understand how you'd get that though, because ${REFRAME_ARGS} isn't quoted in test_suite.sh...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, maybe retry now that the REFRAME_SCALE_TAGS typo is corrected. See if that somehow helps.

Copy link
Collaborator Author

@casparvl casparvl Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you have this:

'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),

as features for your current partition? Or at least something that allows the 1_4_node scale?

Because if I run

reframe --tag=CI --tag=1_4_node  --nocolor -n EESSI_OSU -n EESSI_LAMMPS --list

on Snellius with the standard ReFrame config file (i.e. the one under version control in the test-suite repo) I get the tests listed correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were able to recreate it locally, try with -vvvvv, and see if you can figure out why things get filtered. To reduce the amount of output, you could limit the search path to the lammps test using the -c argument (-c /kyukon/scratch/gent/461/vsc46128/EESSI/test-suite/eessi/testsuite/tests/apps/lammps)

Copy link
Collaborator

@laraPPr laraPPr Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it works for me as well with
reframe --tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS --list
but if I do this than it doesn't and that is what is now done by the bot
reframe '--tag=CI --tag=1_4_node --nocolor -n EESSI_OSU -n EESSI_LAMMPS' --list

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna take a closer look at it next week and figure it out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue with ReFrame because I have now idea what is happening and what might cause it to behave like this reframe-hpc/reframe#3369

test_suite.sh Outdated Show resolved Hide resolved
casparvl and others added 2 commits January 21, 2025 03:03
Co-authored-by: Lara Ramona Peeters <[email protected]>
… as we need it to append the right path to the MODULEPATH for the test_suite.sh
Copy link

eessi-bot bot commented Jan 28, 2025

New job on instance eessi-bot-mc-azure for CPU micro-architecture x86_64-amd-zen4 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.01/pr_850/68

date job status comment
Jan 28 22:47:21 UTC 2025 submitted job id 68 awaits release by job manager
Jan 28 22:47:57 UTC 2025 released job awaits launch by Slurm scheduler
Jan 28 22:53:02 UTC 2025 running job 68 is running
Jan 28 23:06:48 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-68.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-1738105003.tar.gzsize: 1 MiB (1247976 bytes)
entries: 69
modules under 2023.06/software/linux/x86_64/amd/zen4/modules/all
BCFtools/1.19-GCC-13.2.0.lua
software under 2023.06/software/linux/x86_64/amd/zen4/software
BCFtools/1.19-GCC-13.2.0
other under 2023.06/software/linux/x86_64/amd/zen4
no other files in tarball
Jan 28 23:06:48 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86_64_amd_zen4+default
P: perf: 1799.795 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86_64_amd_zen4+default
P: perf: 1783.286 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86_64_amd_zen4+default
P: latency: 4.35 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86_64_amd_zen4+default
P: latency: 4.01 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86_64_amd_zen4+default
P: latency: 10.9 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86_64_amd_zen4+default
P: latency: 13.11 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86_64_amd_zen4+default
P: latency: 0.55 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86_64_amd_zen4+default
P: latency: 0.55 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86_64_amd_zen4+default
P: bandwidth: 49779.97 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86_64_amd_zen4+default
P: bandwidth: 49612.69 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-68.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 28, 2025

Should I also add the gent config to shared_fs_path or is it ok to leave it in the bashrc?

@casparvl
Copy link
Collaborator Author

Well, this PR supports both options, as I see no reason not to respect the environment variable if it is set.

I personally have a preference for shared_fs_path because it is the one path that is guaranteed to be available to all bot jobs (by design). It also means that when deploying a new bot, you only have to fiddle in one directory (this directory also contains your host-injections).

Up to you, I guess :) I would advice to keep the reframe_config.py under version control, in the same repo as your bot config (as I have done for the AWS and Azure clusters, see the relevant PR there)

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 28, 2025

Ok than this one is ready for merging once the bot-configs one is merged?

@casparvl
Copy link
Collaborator Author

Once #850 (comment) is succesfull, I will remove the BCFTools from this PR - that was just a dummy to prove this thing works. Then, we can merge this PR and the PR to the bot-configs, in no particular order ;-)

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 28, 2025

bot: show-config

Copy link

eessi-bot bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command show-config from laraPPr

    • expanded format: show-config
  • handling command show-config failed with message
    unknown command show-config; use bot: help for usage information

Copy link

eessi-bot bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command show-config from laraPPr

    • expanded format: show-config
  • handling command show-config failed with message
    unknown command show-config; use bot: help for usage information

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command show-config from laraPPr

    • expanded format: show-config
  • handling command show-config failed with message
    unknown command show-config; use bot: help for usage information

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 28, 2025

bot: help

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

Copy link

eessi-bot bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

Copy link

eessi-bot bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 28, 2025

bot: show_config

Copy link

eessi-bot bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Jan 28, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

Copy link

eessi-bot bot commented Jan 28, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Jan 28, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Jan 28, 2025

Instance eessi-bot-vsc-ugent is configured to build for:

  • architectures: x86_64/amd/zen3
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi-hpc.org-2023.06-compat

@casparvl
Copy link
Collaborator Author

Ok, should be ready for a final review & merge now!

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 28, 2025

@Neves-P Will this impact the dev.eessi.io? If so you might need to add a reframe config file to that bot as well.

Copy link
Collaborator

@laraPPr laraPPr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Neves-P
Copy link
Member

Neves-P commented Jan 29, 2025

@Neves-P Will this impact the dev.eessi.io? If so you might need to add a reframe config file to that bot as well.

Thanks for the heads up! At the moment I think it should be fine because we aren't running the test command (most of the software to test isn't there yet). But from what I understand it's something to keep in mind when we do start running the test script on dev.eessi.io

@laraPPr laraPPr merged commit 902a20e into EESSI:2023.06-software.eessi.io Jan 29, 2025
49 checks passed
Copy link

eessi-bot bot commented Jan 29, 2025

PR merged! Moved ['/project/def-users/SHARED/jobs/2025.01/pr_850/43200', '/project/def-users/SHARED/jobs/2025.01/pr_850/43201', '/project/def-users/SHARED/jobs/2025.01/pr_850/43202', '/project/def-users/SHARED/jobs/2025.01/pr_850/43203', '/project/def-users/SHARED/jobs/2025.01/pr_850/43204', '/project/def-users/SHARED/jobs/2025.01/pr_850/43205', '/project/def-users/SHARED/jobs/2025.01/pr_850/43206', '/project/def-users/SHARED/jobs/2025.01/pr_850/43208', '/project/def-users/SHARED/jobs/2025.01/pr_850/43209', '/project/def-users/SHARED/jobs/2025.01/pr_850/43210', '/project/def-users/SHARED/jobs/2025.01/pr_850/43211', '/project/def-users/SHARED/jobs/2025.01/pr_850/43212', '/project/def-users/SHARED/jobs/2025.01/pr_850/43213', '/project/def-users/SHARED/jobs/2025.01/pr_850/43214', '/project/def-users/SHARED/jobs/2025.01/pr_850/43215', '/project/def-users/SHARED/jobs/2025.01/pr_850/43216', '/project/def-users/SHARED/jobs/2025.01/pr_850/43217', '/project/def-users/SHARED/jobs/2025.01/pr_850/43218'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.01.29

Copy link

eessi-bot bot commented Jan 29, 2025

PR merged! Moved ['/project/def-users/SHARED/jobs/2025.01/pr_850/68'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.01.29

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Jan 29, 2025

PR merged! Moved [] to /scratch/gent/vo/002/gvo00211/SHARED/trash_bin/EESSI/software-layer/2025.01.29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants