From 2b2f22bd8bb3ec3cb74c53f998cbd3afb8dbd651 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:18:50 -0500 Subject: [PATCH 01/22] Correct osg-ex14 to osg-ex13 --- docs/materials/osg/part1-ex3-hardware-diffs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/materials/osg/part1-ex3-hardware-diffs.md b/docs/materials/osg/part1-ex3-hardware-diffs.md index 3dd670f..b247e79 100644 --- a/docs/materials/osg/part1-ex3-hardware-diffs.md +++ b/docs/materials/osg/part1-ex3-hardware-diffs.md @@ -66,7 +66,7 @@ we will create a **new** submit file with the queue…in syntax and change the value of our parameter (`request_memory`) for each batch of jobs. 1. Log in or switch back to `ap1.facility.path-cc.io` (yes, back to PATh!) -1. Create and change into a new subdirectory called `osg-ex14` +1. Create and change into a new subdirectory called `osg-ex13` 1. Create a submit file named `sleep.sub` that executes the command `/bin/sleep 300`. !!! note @@ -108,7 +108,7 @@ Now you will do essentially the same thing on the OSPool. 1. Log in or switch to `ap40.uw.osg-htc.org` -1. Copy the `osg-ex14` directory from the [section above](#checking-chtc-memory-availability) +1. Copy the `osg-ex13` directory from the [section above](#checking-chtc-memory-availability) from `ap1.facility.path-cc.io` to `ap40.uw.osg-htc.org` If you get stuck during the copying process, refer to [OSG exercise 1.1](part1-ex1-login-scp.md). From 23e24196888c90a5edbe072f068721a1de50d24c Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:19:04 -0500 Subject: [PATCH 02/22] Fix first-image.sif to py-cowsay.sif --- docs/materials/software/part1-ex4-apptainer-build.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/materials/software/part1-ex4-apptainer-build.md b/docs/materials/software/part1-ex4-apptainer-build.md index 8e233a7..fed1f7e 100644 --- a/docs/materials/software/part1-ex4-apptainer-build.md +++ b/docs/materials/software/part1-ex4-apptainer-build.md @@ -93,7 +93,7 @@ allow us to test our new container. 1. Try running: :::console - $ singularity shell first-image.sif + $ singularity shell py-cowsay.sif 1. Then try running the `hello-cow.py` script: From dfb39907878b812a53c4b60c7f7fa28416f9f680 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:23:15 -0500 Subject: [PATCH 03/22] Update Blast version to 2.15 --- docs/materials/software/part4-ex1-download.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/materials/software/part4-ex1-download.md b/docs/materials/software/part4-ex1-download.md index 622ffa7..5819f6e 100644 --- a/docs/materials/software/part4-ex1-download.md +++ b/docs/materials/software/part4-ex1-download.md @@ -63,8 +63,8 @@ it. If you want to do this all from the command line, the sequence will look like this (using `wget` as the download command.) :::console - user@login $ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.0+-x64-linux.tar.gz - user@login $ tar -xzf ncbi-blast-2.14.0+-x64-linux.tar.gz + user@login $ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.15.0+-x64-linux.tar.gz + user@login $ tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz 1. We're going to be using the `blastx` binary in our job. Where is it in the directory you just decompressed? @@ -124,7 +124,7 @@ directory of our downloaded BLAST directory. We need to use the `arguments` line in the submit file to express the rest of the command. :::file - executable = ncbi-blast-2.13.0+/bin/blastx + executable = ncbi-blast-2.15.0+/bin/blastx arguments = -db pdbaa/pdbaa -query mouse.fa -out results.txt * The BLAST program requires our input file and database, so they From aeb595e482b96d5f7c32c55329394b3767acb7c4 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:24:38 -0500 Subject: [PATCH 04/22] Update Blast version to 2.15 --- docs/materials/software/part4-ex2-wrapper.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/materials/software/part4-ex2-wrapper.md b/docs/materials/software/part4-ex2-wrapper.md index 3ed5386..c297ed4 100644 --- a/docs/materials/software/part4-ex2-wrapper.md +++ b/docs/materials/software/part4-ex2-wrapper.md @@ -34,7 +34,7 @@ Our wrapper script will be a bash script that runs several commands. :::bash #!/bin/bash - ncbi-blast-2.12.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results.txt + ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results.txt Submit File Changes @@ -53,7 +53,7 @@ We now need to make some changes to our submit file. 1. Note that since the `blastx` program is no longer listed as the executable, it will be need to be included in `transfer_input_files`. Instead of transferring just that program, we will transfer the original downloaded `tar.gz` file. To achieve efficiency, we'll also transfer the pdbaa database as the original `tar.gz` file instead of as the unzipped folder: :::console - transfer_input_files = pdbaa.tar.gz, mouse.fa, ncbi-blast-2.12.0+-x64-linux.tar.gz + transfer_input_files = pdbaa.tar.gz, mouse.fa, ncbi-blast-2.15.0+-x64-linux.tar.gz 1. If you really want to be on top of things, look at the log file for the last exercise, and update your memory and disk requests to be just slightly above the actual "Usage" values in the log. @@ -73,10 +73,10 @@ Now that our database and BLAST software are being transferred to the job as `ta :::bash #!/bin/bash - tar -xzf ncbi-blast-2.12.0+-x64-linux.tar.gz + tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz - ncbi-blast-2.12.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt + ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt 1. While not strictly necessary, it's a good idea to enable executable permissions on the wrapper script, like so: From 65c5fa6360fba4f9a9291608c79920e85e29e3a5 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:25:55 -0500 Subject: [PATCH 05/22] Update Blast version to 2.15 --- docs/materials/software/part4-ex3-arguments.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/materials/software/part4-ex3-arguments.md b/docs/materials/software/part4-ex3-arguments.md index d412849..6d71fbd 100644 --- a/docs/materials/software/part4-ex3-arguments.md +++ b/docs/materials/software/part4-ex3-arguments.md @@ -50,7 +50,7 @@ and third arguments, respectively. Thus, in the main command of the script, replace the various names with these variables: :::bash - ncbi-blast-2.12.0+/bin/blastx -db $1/$1 -query $2 -out $3 + ncbi-blast-2.15.0+/bin/blastx -db $1/$1 -query $2 -out $3 > If your wrapper script is in a different language, you should use that language's syntax for reading in variables from the command line. @@ -71,12 +71,12 @@ One of the downsides of this approach, is that our command has become harder to read. The original script contains all the information at a glance: :::bash - ncbi-blast-2.12.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt + ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt But our new version is more cryptic -- what is `$1`?: :::bash - ncbi-blast-2.10.1+/bin/blastx -db $1 -query $2 -out $3 + ncbi-blast-2.15.1+/bin/blastx -db $1 -query $2 -out $3 One way to overcome this is to create our own variable names inside the wrapper script and assign the argument values to them. Here is an example for our @@ -89,10 +89,10 @@ BLAST script: INFILE=$2 OUTFILE=$3 - tar -xzf ncbi-blast-2.10.1+-x64-linux.tar.gz + tar -xzf ncbi-blast-2.15.1+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz - ncbi-blast-2.10.1+/bin/blastx -db $DATABASE/$DATABASE -query $INFILE -out $OUTFILE + ncbi-blast-2.15.1+/bin/blastx -db $DATABASE/$DATABASE -query $INFILE -out $OUTFILE Here, we are assigning the input arguments (`$1`, `$2` and `$3`) to new variable names, and then using **those** names (`$DATABASE`, `$INFILE`, and `$OUTFILE`) in the command, From eea4f7111353310931e1fd4310023e91f96840f0 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:28:11 -0500 Subject: [PATCH 06/22] Update HMMER version to 3.4 --- docs/materials/software/part5-ex1-prepackaged.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/materials/software/part5-ex1-prepackaged.md b/docs/materials/software/part5-ex1-prepackaged.md index 7732292..afe8eb1 100644 --- a/docs/materials/software/part5-ex1-prepackaged.md +++ b/docs/materials/software/part5-ex1-prepackaged.md @@ -7,7 +7,7 @@ status: testing Software Exercise 5.1: Pre-package a Research Code ========================================== -**Objective**: Install software (HHMER) to a folder and run it in a job using a wrapper script. +**Objective**: Install software (HMMER) to a folder and run it in a job using a wrapper script. **Why learn this?**: If not using a container, this is a template for how to create a portable software installation using your own files, especially if the software @@ -45,7 +45,7 @@ for this example, we are going to compile directly on the Access Point :::console username@host $ tar -zxf hmmer.tar.gz - username@host $ cd hmmer-3.3.2 + username@host $ cd hmmer-3.4 1. Now we can follow the second set of installation instructions. For the prefix, we'll use the variable `$PWD` to capture the name of our current working directory and then a relative path to the `hmmer-build` directory we created in step 1: @@ -112,7 +112,7 @@ We're almost ready! We need two more pieces to run a HMMER job. run the job. You already have these files back in the directory where you unpacked the source code: :::console - username@ap1 $ ls hmmer-3.3.2/tutorial + username@ap1 $ ls hmmer-3.4/tutorial 7LESS_DROME fn3.hmm globins45.fa globins4.sto MADE1.hmm Pkinase.hmm dna_target.fa fn3.sto globins4.hmm HBB_HUMAN MADE1.sto Pkinase.sto @@ -124,7 +124,7 @@ run the job. You already have these files back in the directory where you unpack :::file executable = run_hmmer.sh - transfer_input_files = hmmer-build.tar.gz, hmmer-3.3.2/tutorial/ + transfer_input_files = hmmer-build.tar.gz, hmmer-3.4/tutorial/ A wrapper script will always be a job's `executable`. When using a wrapper script, you must also always remember to transfer the software/source code using From bcd9ecfe5331f6ffc8eedfc8eca5ab2cb4596ba3 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:31:06 -0500 Subject: [PATCH 07/22] Update Blast version to 2.15 --- docs/materials/data/part1-ex1-data-needs.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/materials/data/part1-ex1-data-needs.md b/docs/materials/data/part1-ex1-data-needs.md index 6290da0..30266f7 100644 --- a/docs/materials/data/part1-ex1-data-needs.md +++ b/docs/materials/data/part1-ex1-data-needs.md @@ -35,8 +35,8 @@ genome information. 1. Copy the BLAST executables: :::console - user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ncbi-blast-2.12.0+-x64-linux.tar.gz - user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ncbi-blast-2.15.0+-x64-linux.tar.gz + user@ap40 $ tar -xzvf ncbi-blast-2.15.0+-x64-linux.tar.gz 1. Download these files to your current directory: @@ -55,7 +55,7 @@ Understanding BLAST Remember that `blastx` is executed in a command like the following: ``` console -user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db -query -out +user@ap40 $ ./ncbi-blast-2.15.0+/bin/blastx -db -query -out ``` In the above, the `` is the name of a file containing a number of genetic sequences (e.g. `mouse.fa`), and From b4b24f12ba12427089d6a030b6431e5898f12dc3 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Tue, 25 Jun 2024 17:32:12 -0500 Subject: [PATCH 08/22] Grammatical typo --- docs/materials/data/part1-ex2-file-transfer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/materials/data/part1-ex2-file-transfer.md b/docs/materials/data/part1-ex2-file-transfer.md index 9ce65e6..f3133f7 100644 --- a/docs/materials/data/part1-ex2-file-transfer.md +++ b/docs/materials/data/part1-ex2-file-transfer.md @@ -209,7 +209,7 @@ destination `science-results/mouse.fa.result`. Remember that the `transfer_output_remaps` value requires double quotes around it. -Submit the job, and wait for it to complete. Was there +Submit the job, and wait for it to complete. Are there any errors? Can you find mouse.fa.result? Conclusions From ebaaa35be21b0a5a980d47c108738aa845601b1e Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Thu, 27 Jun 2024 12:07:07 -0500 Subject: [PATCH 09/22] Update ap1 on PATh to ap40 on OSPool --- docs/materials/htcondor/part1-ex1-login.md | 38 +++++++++---------- docs/materials/htcondor/part1-ex2-commands.md | 31 ++++++++------- docs/materials/htcondor/part1-ex3-jobs.md | 16 ++++---- 3 files changed, 43 insertions(+), 42 deletions(-) diff --git a/docs/materials/htcondor/part1-ex1-login.md b/docs/materials/htcondor/part1-ex1-login.md index a1b0d1f..70c2c0b 100644 --- a/docs/materials/htcondor/part1-ex1-login.md +++ b/docs/materials/htcondor/part1-ex1-login.md @@ -12,32 +12,32 @@ Background There are different High Throughput Computing (HTC) systems at universities, government facilities, and other institutions around the world, and they may have different user experiences. For example, some systems have dedicated resources (which means your job will be guaranteed a certain amount of resources/time to complete), while other systems have opportunistic, backfill resources (which means your job can take advantage of some resources, but those resources could be removed at any time). Other systems have a mix of dedicated and opportunistic resources. -Durring the OSG School, you will practice on two different HTC systems: the "[PATh Facility](https://path-cc.io/facility/)" and "[OSG's Open Science Pool (OSPool)](https://osg-htc.org/services/open_science_pool.html)". This will help prepare you for working on a variety of different HTC systems. +Durring the OSG School, you will practice on the [OSG's Open Science Pool (OSPool)](https://osg-htc.org/services/open_science_pool.html). We work with a variety of different HTC systems. -* PATh Facility: The PATh Facility provides researchers with **dedicated HTC resources and the ability to run larger and longer jobs**. The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. -* OSG's Open Science Pool: The OSPool provides researchers with **opportunitistic resources and the ability to run many smaller and shorter jobs silmnulatinously**. The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs. +* [PATh Facility](https://path-cc.io/facility/): The PATh Facility provides researchers with **dedicated HTC resources and the ability to run larger and longer jobs**. The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. +* [OSG's Open Science Pool](https://osg-htc.org/services/open_science_pool.html): The OSPool provides researchers with **opportunitistic resources and the ability to run many smaller and shorter jobs simultaneously**. The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs. Exercise Goal --- -The goal of this first exercise is to log in to the PATh Facility access point and look around a little bit, which will take only a few minutes. +The goal of this first exercise is to log in to the OSPool access point and look around a little bit, which will take only a few minutes. **If you have trouble getting SSH access to the submit server, ask the instructors right away! Gaining access is critical for all remaining exercises.** Logging In ---------- -Today, you will use a High Throughput Computing system known as the "[PATh Facility](https://path-cc.io/facility/)". The PATh Facility provides users with dedicated resources and longer runtimes than OSG's Open Science Pool. +Today, you will use a High Throughput Computing system known as the [Open Science Pool (OSPool)](https://osg-htc.org/services/open_science_pool.html). The Open Science Pool provides users with opportunistic resources and shorter runtimes than PATh Facility. -You will login to the access point of the PATh Facility, which is called `ap1.facility.path-cc.io` using the username you previously created. +You will login to the access point of the OSPool, which is called `ap40.uw.osg-htc.org` using the username you previously created. To log in, use a [Secure Shell](http://en.wikipedia.org/wiki/Secure_Shell) (SSH) client. - From a Mac or Linux computer, start the Terminal app and run the below `ssh` command, replacing with your username: ``` hl_lines="1" -$ ssh @ap1.facility.path-cc.io +$ ssh @ap40.uw.osg-htc.org ``` - On Windows, we recommend a free client called [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/), @@ -51,12 +51,12 @@ Running Commands In the exercises, we will show commands that you are supposed to type or copy into the command line, like this: ``` console -username@ap1 $ hostname -ap1.facility.path-cc.io +username@ap40 $ hostname +ap40.uw.osg-htc.org ``` !!! note - In the first line of the example above, the `username@ap1 $` part is meant to show the Linux command-line prompt. + In the first line of the example above, the `username@ap40 $` part is meant to show the Linux command-line prompt. You do not type this part! Further, your actual prompt probably is a bit different, and that is expected. So in the example above, the command that you type at your own prompt is just the eight characters `hostname`. The second line of the example, without the prompt, shows the output of the command; you do not type this part, @@ -65,9 +65,9 @@ ap1.facility.path-cc.io Here are a few other commands that you can try (the examples below do not show the output from each command): ``` console -username@ap1 $ whoami -username@ap1 $ date -username@ap1 $ uname -a +username@ap40 $ whoami +username@ap40 $ date +username@ap40 $ uname -a ``` A suggestion for the day: try typing into the command line as many of the commands as you can. @@ -81,8 +81,8 @@ You will be doing many different exercises over the next few days, many of them For instance, for the rest of this exercise, you may wish to create and use a directory named `intro-1.1-login`, or something like that. ``` console -username@ap1 $ mkdir intro-1.1-login -username@ap1 $ cd intro-1.1-login +username@ap40 $ mkdir intro-1.1-login +username@ap40 $ cd intro-1.1-login ``` Showing the Version of HTCondor @@ -91,12 +91,12 @@ Showing the Version of HTCondor HTCondor is installed on this server. But what version? You can ask HTCondor itself: ``` console -username@ap1 $ condor_version -$CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ -$CondorPlatform: x86_64_AlmaLinux8 $ +username@ap40 $ condor_version +$CondorVersion: 23.8.0 2024-05-27 BuildID: 735879 PackageID: 23.8.0-0.735879 GitSHA: 26d1081b RC $ +$CondorPlatform: x86_64_AlmaLinux9 $ ``` -As you can see from the output, we are using HTCondor 10.7.0. +As you can see from the output, we are using HTCondor 23.8.0. Reference Materials diff --git a/docs/materials/htcondor/part1-ex2-commands.md b/docs/materials/htcondor/part1-ex2-commands.md index 47dca37..ad862ae 100644 --- a/docs/materials/htcondor/part1-ex2-commands.md +++ b/docs/materials/htcondor/part1-ex2-commands.md @@ -23,7 +23,7 @@ As discussed in the lecture, the `condor_status` command is used to view the cur At its most basic, the command is: ``` console -username@ap1 $ condor_status +username@ap40 $ condor_status ``` When running this command, there is typically a lot of output printed to the screen. Looking at your terminal output, there is one line per execute point slot. **TIP: You can widen your terminal window, which may help you to see all details of the output better.** @@ -31,17 +31,17 @@ When running this command, there is typically a lot of output printed to the scr *Here is some example output (what you see will be longer):* ``` console -slot1@FIU-PATH-EP.osgvo-docker-pilot-55c74f5b7c-kbs77 LINUX X86_64 Unclaimed Idle 0.000 8053 0+01:14:34 -slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n LINUX X86_64 Claimed Busy 0.930 1024 0+02:42:08 -slot1@WISC-PATH-EP.osgvo-docker-pilot-7b46dbdbb7-xqkkg LINUX X86_64 Claimed Busy 3.530 1024 0+02:40:24 -slot1@SYRA-PATH-EP.osgvo-docker-pilot-gpu-7f6c64d459 LINUX X86_64 Owner Idle 0.300 250 7+03:22:21 +slot1_37@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy +slot1_38@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy +slot1_39@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy +slot1_40@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy ``` This output consists of 8 columns: | Col | Example | Meaning | |:-----------|:-----------------------------|:------------------------------------------------------------------------------------------------------------------------| -| Name | `slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n` | Full slot name (including the hostname) | +| Name | `slot1_37@glidein_83184_146090973@z3011.hyak.local` | Full slot name (including the hostname) | | OpSys | `LINUX` | Operating system | | Arch | `X86_64` | Slot architecture (e.g., Intel 64 bit) | | State | `Claimed` | State of the slot (`Unclaimed` is available, `Owner` is being used by the machine owner, `Claimed` is matched to a job) | @@ -53,12 +53,11 @@ This output consists of 8 columns: At the end of the slot listing, there is a summary. Here is an example: ``` console - Machines Owner Claimed Unclaimed Matched Preempting Drain + Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle - X86_64/LINUX 10831 0 10194 631 0 0 6 - X86_64/WINDOWS 2 2 0 0 0 0 0 +X86_64/LINUX 36913 0 32340 4565 0 8 0 0 0 - Total 10833 2 10194 631 0 0 6 + Total 36913 0 32340 4565 0 8 0 0 0 ``` There is one row of summary for each machine (i.e. "slot") architecture/operating system combination with columns for the number of slots in each state. The final row gives a summary of slot states for the whole pool. @@ -74,7 +73,7 @@ There is one row of summary for each machine (i.e. "slot") architecture/operatin Also try out the `-compact` for a slightly different view of whole machines (i.e. server hostnames), without the individual slots shown. ``` console -username@ap1 $ condor_status -compact +username@ap40 $ condor_status -compact ``` **How has the column information changed?** @@ -89,13 +88,13 @@ The `condor_q` command lists jobs that are on this access point machine and that The default behavior of the command lists only your jobs: ``` console -username@ap1 $ condor_q +username@ap40 $ condor_q ``` The main part of the output (which will be empty, because you haven't submitted jobs yet) shows one set ("batch") of submitted jobs per line. If you had a single job in the queue, it would look something like the below: ``` console --- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 09:59:31 +-- Schedd: ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 06/26/24 16:41:08 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice CMD: run_ffmpeg.sh 7/12 09:58 _ _ 1 1 18801.0 ``` @@ -132,7 +131,7 @@ It shows total counts of jobs in the different possible states. By default, the `condor_q` command shows **your** jobs only. To see everyone’s jobs that are queued on the machine, add the `-all` option: ``` console -username@ap1 $ condor_q -all +username@ap40 $ condor_q -all ``` - How many jobs are queued in total (i.e., running or waiting to run)? @@ -143,13 +142,13 @@ username@ap1 $ condor_q -all The `condor_q` output, by default, groups "batches" of jobs together (if they were submitted with the same submit file or "jobbatchname"). To see more information for EVERY job on a separate line of output, use the `-nobatch` option to `condor_q`: ``` console -username@ap1 $ condor_q -all -nobatch +username@ap40 $ condor_q -all -nobatch ``` **How has the column information changed?** (Below is an example of the top of the output.) ``` console --- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 11:58:44 +-- Schedd: ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 06/26/24 16:41:08 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18203.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18204.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal diff --git a/docs/materials/htcondor/part1-ex3-jobs.md b/docs/materials/htcondor/part1-ex3-jobs.md index c2f138c..e72888f 100644 --- a/docs/materials/htcondor/part1-ex3-jobs.md +++ b/docs/materials/htcondor/part1-ex3-jobs.md @@ -10,7 +10,7 @@ HTC Exercise 1.3: Run Jobs! Exercise Goal ------------- -The goal of this exercise is to submit jobs to HTCondor and have them run on the PATh Facility. This is a huge step in learning to use an HTC system! +The goal of this exercise is to submit jobs to HTCondor and have them run on the OSPool. This is a huge step in learning to use an HTC system! **This exercise will take longer than the first two, short ones. If you are having any problems getting the jobs to run, please ask the instructors! It is very important that you know how to run jobs.** @@ -28,6 +28,8 @@ output = hostname.out error = hostname.err log = hostname.log +requirements = (OSGVO_OS_STRING == "UBUNTU 20" || OSGVO_OS_STRING == "UBUNTU 22") + request_cpus = 1 request_memory = 1GB request_disk = 1GB @@ -58,7 +60,7 @@ Note that we are not using the `arguments` or `transfer_input_files` lines that Double-check your submit file, so that it matches the text above. Then, tell HTCondor to run your job: ``` console -username@ap1 $ condor_submit hostname.sub +username@ap40 $ condor_submit hostname.sub Submitting job(s). 1 job(s) submitted to cluster NNNN. ``` @@ -76,7 +78,7 @@ You may not even catch the job in the `R` running state, because the `hostname` After the job finishes, check for the `hostname` output in `hostname.out`, which is where job information printed to the terminal screen will be printed for the job. ``` console -username@ap1 $ cat hostname.out +username@ap40 $ cat hostname.out e171.chtc.wisc.edu ``` @@ -87,13 +89,13 @@ The `hostname.err` file should be empty, unless there were issues running the `h Very often, when you run a command on the command line, it includes arguments (i.e. options) after the program name, as in the below examples: ``` console -username@ap1 $ sleep 60 +username@ap40 $ sleep 60 ``` In an HTCondor submit file, the program (or 'executable') name goes in the `executable` statement and **all remaining arguments** go into an `arguments` statement. For example, if the full command is: ``` console -username@ap1 $ sleep 60 +username@ap40 $ sleep 60 ``` Then in the submit file, we would put the location of the "sleep" program (you can find it with `which sleep`) as the job `executable`, and `60` as the job `arguments`: @@ -154,12 +156,12 @@ or perhaps a shell script of commands that you'd like to run within a job. In th 1. Add executable permissions to the file (so that it can be run as a program): :::console - username@ap1 $ chmod +x test-script.sh + username@ap40 $ chmod +x test-script.sh 1. Test your script from the command line: :::console - username@ap1 $ ./test-script.sh hello 42 + username@ap40 $ ./test-script.sh hello 42 Date: Mon Jul 17 10:02:20 CDT 2017 Host: learn.chtc.wisc.edu System: Linux x86_64 GNU/Linux From bd3ddfa9933e9392bb5185d4b241da04de611e17 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 28 Jun 2024 10:07:19 -0500 Subject: [PATCH 10/22] Revert "Update ap1 on PATh to ap40 on OSPool" This reverts commit ebaaa35be21b0a5a980d47c108738aa845601b1e. --- docs/materials/htcondor/part1-ex1-login.md | 38 +++++++++---------- docs/materials/htcondor/part1-ex2-commands.md | 31 +++++++-------- docs/materials/htcondor/part1-ex3-jobs.md | 16 ++++---- 3 files changed, 42 insertions(+), 43 deletions(-) diff --git a/docs/materials/htcondor/part1-ex1-login.md b/docs/materials/htcondor/part1-ex1-login.md index 70c2c0b..a1b0d1f 100644 --- a/docs/materials/htcondor/part1-ex1-login.md +++ b/docs/materials/htcondor/part1-ex1-login.md @@ -12,32 +12,32 @@ Background There are different High Throughput Computing (HTC) systems at universities, government facilities, and other institutions around the world, and they may have different user experiences. For example, some systems have dedicated resources (which means your job will be guaranteed a certain amount of resources/time to complete), while other systems have opportunistic, backfill resources (which means your job can take advantage of some resources, but those resources could be removed at any time). Other systems have a mix of dedicated and opportunistic resources. -Durring the OSG School, you will practice on the [OSG's Open Science Pool (OSPool)](https://osg-htc.org/services/open_science_pool.html). We work with a variety of different HTC systems. +Durring the OSG School, you will practice on two different HTC systems: the "[PATh Facility](https://path-cc.io/facility/)" and "[OSG's Open Science Pool (OSPool)](https://osg-htc.org/services/open_science_pool.html)". This will help prepare you for working on a variety of different HTC systems. -* [PATh Facility](https://path-cc.io/facility/): The PATh Facility provides researchers with **dedicated HTC resources and the ability to run larger and longer jobs**. The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. -* [OSG's Open Science Pool](https://osg-htc.org/services/open_science_pool.html): The OSPool provides researchers with **opportunitistic resources and the ability to run many smaller and shorter jobs simultaneously**. The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs. +* PATh Facility: The PATh Facility provides researchers with **dedicated HTC resources and the ability to run larger and longer jobs**. The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. +* OSG's Open Science Pool: The OSPool provides researchers with **opportunitistic resources and the ability to run many smaller and shorter jobs silmnulatinously**. The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs. Exercise Goal --- -The goal of this first exercise is to log in to the OSPool access point and look around a little bit, which will take only a few minutes. +The goal of this first exercise is to log in to the PATh Facility access point and look around a little bit, which will take only a few minutes. **If you have trouble getting SSH access to the submit server, ask the instructors right away! Gaining access is critical for all remaining exercises.** Logging In ---------- -Today, you will use a High Throughput Computing system known as the [Open Science Pool (OSPool)](https://osg-htc.org/services/open_science_pool.html). The Open Science Pool provides users with opportunistic resources and shorter runtimes than PATh Facility. +Today, you will use a High Throughput Computing system known as the "[PATh Facility](https://path-cc.io/facility/)". The PATh Facility provides users with dedicated resources and longer runtimes than OSG's Open Science Pool. -You will login to the access point of the OSPool, which is called `ap40.uw.osg-htc.org` using the username you previously created. +You will login to the access point of the PATh Facility, which is called `ap1.facility.path-cc.io` using the username you previously created. To log in, use a [Secure Shell](http://en.wikipedia.org/wiki/Secure_Shell) (SSH) client. - From a Mac or Linux computer, start the Terminal app and run the below `ssh` command, replacing with your username: ``` hl_lines="1" -$ ssh @ap40.uw.osg-htc.org +$ ssh @ap1.facility.path-cc.io ``` - On Windows, we recommend a free client called [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/), @@ -51,12 +51,12 @@ Running Commands In the exercises, we will show commands that you are supposed to type or copy into the command line, like this: ``` console -username@ap40 $ hostname -ap40.uw.osg-htc.org +username@ap1 $ hostname +ap1.facility.path-cc.io ``` !!! note - In the first line of the example above, the `username@ap40 $` part is meant to show the Linux command-line prompt. + In the first line of the example above, the `username@ap1 $` part is meant to show the Linux command-line prompt. You do not type this part! Further, your actual prompt probably is a bit different, and that is expected. So in the example above, the command that you type at your own prompt is just the eight characters `hostname`. The second line of the example, without the prompt, shows the output of the command; you do not type this part, @@ -65,9 +65,9 @@ ap40.uw.osg-htc.org Here are a few other commands that you can try (the examples below do not show the output from each command): ``` console -username@ap40 $ whoami -username@ap40 $ date -username@ap40 $ uname -a +username@ap1 $ whoami +username@ap1 $ date +username@ap1 $ uname -a ``` A suggestion for the day: try typing into the command line as many of the commands as you can. @@ -81,8 +81,8 @@ You will be doing many different exercises over the next few days, many of them For instance, for the rest of this exercise, you may wish to create and use a directory named `intro-1.1-login`, or something like that. ``` console -username@ap40 $ mkdir intro-1.1-login -username@ap40 $ cd intro-1.1-login +username@ap1 $ mkdir intro-1.1-login +username@ap1 $ cd intro-1.1-login ``` Showing the Version of HTCondor @@ -91,12 +91,12 @@ Showing the Version of HTCondor HTCondor is installed on this server. But what version? You can ask HTCondor itself: ``` console -username@ap40 $ condor_version -$CondorVersion: 23.8.0 2024-05-27 BuildID: 735879 PackageID: 23.8.0-0.735879 GitSHA: 26d1081b RC $ -$CondorPlatform: x86_64_AlmaLinux9 $ +username@ap1 $ condor_version +$CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ +$CondorPlatform: x86_64_AlmaLinux8 $ ``` -As you can see from the output, we are using HTCondor 23.8.0. +As you can see from the output, we are using HTCondor 10.7.0. Reference Materials diff --git a/docs/materials/htcondor/part1-ex2-commands.md b/docs/materials/htcondor/part1-ex2-commands.md index ad862ae..47dca37 100644 --- a/docs/materials/htcondor/part1-ex2-commands.md +++ b/docs/materials/htcondor/part1-ex2-commands.md @@ -23,7 +23,7 @@ As discussed in the lecture, the `condor_status` command is used to view the cur At its most basic, the command is: ``` console -username@ap40 $ condor_status +username@ap1 $ condor_status ``` When running this command, there is typically a lot of output printed to the screen. Looking at your terminal output, there is one line per execute point slot. **TIP: You can widen your terminal window, which may help you to see all details of the output better.** @@ -31,17 +31,17 @@ When running this command, there is typically a lot of output printed to the scr *Here is some example output (what you see will be longer):* ``` console -slot1_37@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy -slot1_38@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy -slot1_39@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy -slot1_40@glidein_83184_146090973@z3011.hyak.local LINUX X86_64 Claimed Busy +slot1@FIU-PATH-EP.osgvo-docker-pilot-55c74f5b7c-kbs77 LINUX X86_64 Unclaimed Idle 0.000 8053 0+01:14:34 +slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n LINUX X86_64 Claimed Busy 0.930 1024 0+02:42:08 +slot1@WISC-PATH-EP.osgvo-docker-pilot-7b46dbdbb7-xqkkg LINUX X86_64 Claimed Busy 3.530 1024 0+02:40:24 +slot1@SYRA-PATH-EP.osgvo-docker-pilot-gpu-7f6c64d459 LINUX X86_64 Owner Idle 0.300 250 7+03:22:21 ``` This output consists of 8 columns: | Col | Example | Meaning | |:-----------|:-----------------------------|:------------------------------------------------------------------------------------------------------------------------| -| Name | `slot1_37@glidein_83184_146090973@z3011.hyak.local` | Full slot name (including the hostname) | +| Name | `slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n` | Full slot name (including the hostname) | | OpSys | `LINUX` | Operating system | | Arch | `X86_64` | Slot architecture (e.g., Intel 64 bit) | | State | `Claimed` | State of the slot (`Unclaimed` is available, `Owner` is being used by the machine owner, `Claimed` is matched to a job) | @@ -53,11 +53,12 @@ This output consists of 8 columns: At the end of the slot listing, there is a summary. Here is an example: ``` console - Total Owner Claimed Unclaimed Matched Preempting Drain Backfill BkIdle + Machines Owner Claimed Unclaimed Matched Preempting Drain -X86_64/LINUX 36913 0 32340 4565 0 8 0 0 0 + X86_64/LINUX 10831 0 10194 631 0 0 6 + X86_64/WINDOWS 2 2 0 0 0 0 0 - Total 36913 0 32340 4565 0 8 0 0 0 + Total 10833 2 10194 631 0 0 6 ``` There is one row of summary for each machine (i.e. "slot") architecture/operating system combination with columns for the number of slots in each state. The final row gives a summary of slot states for the whole pool. @@ -73,7 +74,7 @@ There is one row of summary for each machine (i.e. "slot") architecture/operatin Also try out the `-compact` for a slightly different view of whole machines (i.e. server hostnames), without the individual slots shown. ``` console -username@ap40 $ condor_status -compact +username@ap1 $ condor_status -compact ``` **How has the column information changed?** @@ -88,13 +89,13 @@ The `condor_q` command lists jobs that are on this access point machine and that The default behavior of the command lists only your jobs: ``` console -username@ap40 $ condor_q +username@ap1 $ condor_q ``` The main part of the output (which will be empty, because you haven't submitted jobs yet) shows one set ("batch") of submitted jobs per line. If you had a single job in the queue, it would look something like the below: ``` console --- Schedd: ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 06/26/24 16:41:08 +-- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 09:59:31 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice CMD: run_ffmpeg.sh 7/12 09:58 _ _ 1 1 18801.0 ``` @@ -131,7 +132,7 @@ It shows total counts of jobs in the different possible states. By default, the `condor_q` command shows **your** jobs only. To see everyone’s jobs that are queued on the machine, add the `-all` option: ``` console -username@ap40 $ condor_q -all +username@ap1 $ condor_q -all ``` - How many jobs are queued in total (i.e., running or waiting to run)? @@ -142,13 +143,13 @@ username@ap40 $ condor_q -all The `condor_q` output, by default, groups "batches" of jobs together (if they were submitted with the same submit file or "jobbatchname"). To see more information for EVERY job on a separate line of output, use the `-nobatch` option to `condor_q`: ``` console -username@ap40 $ condor_q -all -nobatch +username@ap1 $ condor_q -all -nobatch ``` **How has the column information changed?** (Below is an example of the top of the output.) ``` console --- Schedd: ap40.uw.osg-htc.org : <128.105.68.62:9618?... @ 06/26/24 16:41:08 +-- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 11:58:44 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18203.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18204.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal diff --git a/docs/materials/htcondor/part1-ex3-jobs.md b/docs/materials/htcondor/part1-ex3-jobs.md index e72888f..c2f138c 100644 --- a/docs/materials/htcondor/part1-ex3-jobs.md +++ b/docs/materials/htcondor/part1-ex3-jobs.md @@ -10,7 +10,7 @@ HTC Exercise 1.3: Run Jobs! Exercise Goal ------------- -The goal of this exercise is to submit jobs to HTCondor and have them run on the OSPool. This is a huge step in learning to use an HTC system! +The goal of this exercise is to submit jobs to HTCondor and have them run on the PATh Facility. This is a huge step in learning to use an HTC system! **This exercise will take longer than the first two, short ones. If you are having any problems getting the jobs to run, please ask the instructors! It is very important that you know how to run jobs.** @@ -28,8 +28,6 @@ output = hostname.out error = hostname.err log = hostname.log -requirements = (OSGVO_OS_STRING == "UBUNTU 20" || OSGVO_OS_STRING == "UBUNTU 22") - request_cpus = 1 request_memory = 1GB request_disk = 1GB @@ -60,7 +58,7 @@ Note that we are not using the `arguments` or `transfer_input_files` lines that Double-check your submit file, so that it matches the text above. Then, tell HTCondor to run your job: ``` console -username@ap40 $ condor_submit hostname.sub +username@ap1 $ condor_submit hostname.sub Submitting job(s). 1 job(s) submitted to cluster NNNN. ``` @@ -78,7 +76,7 @@ You may not even catch the job in the `R` running state, because the `hostname` After the job finishes, check for the `hostname` output in `hostname.out`, which is where job information printed to the terminal screen will be printed for the job. ``` console -username@ap40 $ cat hostname.out +username@ap1 $ cat hostname.out e171.chtc.wisc.edu ``` @@ -89,13 +87,13 @@ The `hostname.err` file should be empty, unless there were issues running the `h Very often, when you run a command on the command line, it includes arguments (i.e. options) after the program name, as in the below examples: ``` console -username@ap40 $ sleep 60 +username@ap1 $ sleep 60 ``` In an HTCondor submit file, the program (or 'executable') name goes in the `executable` statement and **all remaining arguments** go into an `arguments` statement. For example, if the full command is: ``` console -username@ap40 $ sleep 60 +username@ap1 $ sleep 60 ``` Then in the submit file, we would put the location of the "sleep" program (you can find it with `which sleep`) as the job `executable`, and `60` as the job `arguments`: @@ -156,12 +154,12 @@ or perhaps a shell script of commands that you'd like to run within a job. In th 1. Add executable permissions to the file (so that it can be run as a program): :::console - username@ap40 $ chmod +x test-script.sh + username@ap1 $ chmod +x test-script.sh 1. Test your script from the command line: :::console - username@ap40 $ ./test-script.sh hello 42 + username@ap1 $ ./test-script.sh hello 42 Date: Mon Jul 17 10:02:20 CDT 2017 Host: learn.chtc.wisc.edu System: Linux x86_64 GNU/Linux From 3f7ba6a1db2981d7ae534581a072dbbc6acab7e4 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 1 Jul 2024 13:58:10 -0500 Subject: [PATCH 11/22] Update returned statements from commands --- docs/materials/htcondor/part1-ex1-login.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/materials/htcondor/part1-ex1-login.md b/docs/materials/htcondor/part1-ex1-login.md index a1b0d1f..94a00b1 100644 --- a/docs/materials/htcondor/part1-ex1-login.md +++ b/docs/materials/htcondor/part1-ex1-login.md @@ -52,7 +52,7 @@ In the exercises, we will show commands that you are supposed to type or copy in ``` console username@ap1 $ hostname -ap1.facility.path-cc.io +path-ap2001 ``` !!! note @@ -92,7 +92,7 @@ HTCondor is installed on this server. But what version? You can ask HTCondor its ``` console username@ap1 $ condor_version -$CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ +$CondorVersion: 23.9.0 2024-06-27 BuildID: 742143 PackageID: 23.9.0-0.742143 GitSHA: 68fde429 RC $ $CondorPlatform: x86_64_AlmaLinux8 $ ``` From 732f87a01325f0d2e6ea2003eac1f59937394f3f Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 1 Jul 2024 14:11:36 -0500 Subject: [PATCH 12/22] Update returned statements from commands --- docs/materials/htcondor/part1-ex3-jobs.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/materials/htcondor/part1-ex3-jobs.md b/docs/materials/htcondor/part1-ex3-jobs.md index c2f138c..e6bbf28 100644 --- a/docs/materials/htcondor/part1-ex3-jobs.md +++ b/docs/materials/htcondor/part1-ex3-jobs.md @@ -159,13 +159,13 @@ or perhaps a shell script of commands that you'd like to run within a job. In th 1. Test your script from the command line: :::console - username@ap1 $ ./test-script.sh hello 42 - Date: Mon Jul 17 10:02:20 CDT 2017 - Host: learn.chtc.wisc.edu - System: Linux x86_64 GNU/Linux + username@ap1 $ ./test-script.sh hello 42 + Date: Mon Jul 1 14:03:56 CDT 2024 + Host: path-ap2001 + System: Linux x86_64 GNU/Linux Program: ./test-script.sh Args: hello 42 - ls: hostname.sub montage hostname.err hostname.log hostname.out test-script.sh + ls: hostname.err hostname.log hostname.out hostname.sub sleep.log sleep.sub test-script.sh This step is **really** important! If you cannot run your executable from the command-line, HTCondor probably cannot run it on another machine, either. Further, debugging problems like this one is surprisingly difficult. From f990ea2325aa785533c11b0578ef2518bb272d47 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 1 Jul 2024 14:40:07 -0500 Subject: [PATCH 13/22] Updated dates to 2024 --- docs/materials/htcondor/part1-ex4-logs.md | 28 +++++++++++------------ 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/materials/htcondor/part1-ex4-logs.md b/docs/materials/htcondor/part1-ex4-logs.md index d5a1bc7..d8dc8d7 100644 --- a/docs/materials/htcondor/part1-ex4-logs.md +++ b/docs/materials/htcondor/part1-ex4-logs.md @@ -23,15 +23,15 @@ For this exercise, we can examine a log file for any previous job that you have A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are **all** of the event headings from the `sleep` job log (detailed output in between headings has been omitted here): ``` file -000 (5739.000.000) 2023-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> -040 (5739.000.000) 2023-07-10 10:45:10 Started transferring input files -040 (5739.000.000) 2023-07-10 10:45:10 Finished transferring input files -001 (5739.000.000) 2023-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> -006 (5739.000.000) 2023-07-10 10:45:20 Image size of job updated: 72 -040 (5739.000.000) 2023-07-10 10:45:20 Started transferring output files -040 (5739.000.000) 2023-07-10 10:45:20 Finished transferring output files -006 (5739.000.000) 2023-07-10 10:46:11 Image size of job updated: 4072 -005 (5739.000.000) 2023-07-10 10:46:11 Job terminated. +000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> +040 (5739.000.000) 2024-07-10 10:45:10 Started transferring input files +040 (5739.000.000) 2024-07-10 10:45:10 Finished transferring input files +001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> +006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 +040 (5739.000.000) 2024-07-10 10:45:20 Started transferring output files +040 (5739.000.000) 2024-07-10 10:45:20 Finished transferring output files +006 (5739.000.000) 2024-07-10 10:46:11 Image size of job updated: 4072 +005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. ``` There is a lot of extra information in those lines, but you can see: @@ -43,7 +43,7 @@ There is a lot of extra information in those lines, but you can see: Some events provide no information in addition to the heading. For example: ``` file -000 (5739.000.000) 2020-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> +000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> ... ``` @@ -53,7 +53,7 @@ Some events provide no information in addition to the heading. For example: However, some lines have additional information to help you quickly understand where and how your job is running. For example: ``` file -001 (5739.000.000) 2020-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> +001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> SlotName: slot1@WISC-PATH-IDPL-EP.osgvo-docker-pilot-idpl-7c6575d494-2sj5w CondorScratchDir = "/pilot/osgvo-pilot-2q71K9/execute/dir_9316" Cpus = 1 @@ -70,7 +70,7 @@ However, some lines have additional information to help you quickly understand w Another example of is the periodic update: ``` file -006 (5739.000.000) 2020-07-10 10:45:20 Image size of job updated: 72 +006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 1 - MemoryUsage of job (MB) 72 - ResidentSetSize of job (KB) ... @@ -81,7 +81,7 @@ These updates record the amount of memory that the job is using on the execute m The job termination event includes a lot of very useful information: ``` file -005 (5739.000.000) 2023-07-10 10:46:11 Job terminated. +005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage @@ -95,7 +95,7 @@ The job termination event includes a lot of very useful information: Cpus : 1 1 Disk (KB) : 40 30 4203309 Memory (MB) : 1 1 1 -Job terminated of its own accord at 2023-07-10 10:46:11 with exit-code 0. +Job terminated of its own accord at 2024-07-10 10:46:11 with exit-code 0. ... ``` From 8cc17decc835e5c7bd762cfa9eecd5446019be9a Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 1 Jul 2024 14:50:05 -0500 Subject: [PATCH 14/22] Update learn to ap1 --- docs/materials/htcondor/part1-ex5-request.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/materials/htcondor/part1-ex5-request.md b/docs/materials/htcondor/part1-ex5-request.md index 7a57bb2..84cd9e6 100644 --- a/docs/materials/htcondor/part1-ex5-request.md +++ b/docs/materials/htcondor/part1-ex5-request.md @@ -44,7 +44,7 @@ On Mac and Windows, for example, the "Activity Monitor" and "Task Manager" appli Using `ps`: ``` console -username@learn $ ps ux +username@ap1 $ ps ux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND alice 24342 0.0 0.0 90224 1864 ? S 13:39 0:00 sshd: alice@pts/0 alice 24343 0.0 0.0 66096 1580 pts/0 Ss 13:39 0:00 -bash From 2f508643a6f92319c367023a7e5cb02c0b25ca4e Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 1 Jul 2024 16:11:19 -0500 Subject: [PATCH 15/22] Update learn to ap1 --- docs/materials/htcondor/part1-ex7-compile.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/materials/htcondor/part1-ex7-compile.md b/docs/materials/htcondor/part1-ex7-compile.md index 94a32c7..1bf0788 100644 --- a/docs/materials/htcondor/part1-ex7-compile.md +++ b/docs/materials/htcondor/part1-ex7-compile.md @@ -48,19 +48,19 @@ Save that code to a file, for example, `simple.c`. Compile the program with static linking: ``` console -username@learn $ gcc -static -o simple simple.c +username@ap1 $ gcc -static -o simple simple.c ``` As always, test that you can run your command from the command line first. First, without arguments to make sure it fails correctly: ``` console -username@learn $ ./simple +username@ap1 $ ./simple ``` and then with valid arguments: ``` console -username@learn $ ./simple 5 21 +username@ap1 $ ./simple 5 21 ``` Running a Compiled C Program From f831bde3d92fdc7b15b0df84ca637363f052119c Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 1 Jul 2024 16:20:23 -0500 Subject: [PATCH 16/22] Update learn to ap1; added links to HTCondor docs --- docs/materials/htcondor/part1-ex8-queue.md | 26 +++++++++++----------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/materials/htcondor/part1-ex8-queue.md b/docs/materials/htcondor/part1-ex8-queue.md index 0cfc079..d6cea46 100644 --- a/docs/materials/htcondor/part1-ex8-queue.md +++ b/docs/materials/htcondor/part1-ex8-queue.md @@ -17,32 +17,32 @@ Selecting Jobs The `condor_q` program has many options for selecting which jobs are listed. You have already seen that the default mode is to show only your jobs in "batch" mode: ``` console -username@learn $ condor_q +username@ap1 $ condor_q ``` You've seen that you can view all jobs (all users) in the submit node's queue by using the `-all` argument: ``` console -username@learn $ condor_q -all +username@ap1 $ condor_q -all ``` And you've seen that you can view more details about queued jobs, with each separate job on a single line using the `-nobatch` option: ``` console -username@learn $ condor_q -nobatch -username@learn $ condor_q -all -nobatch +username@ap1 $ condor_q -nobatch +username@ap1 $ condor_q -all -nobatch ``` Did you know you can also name one or more user IDs on the command line, in which case jobs for all of the named users are listed at once? ``` console -username@learn $ condor_q +username@ap1 $ condor_q ``` To list just the jobs associated with a single cluster number: ``` console -username@learn $ condor_q +username@ap1 $ condor_q ``` For example, if you want to see the jobs in cluster 5678 (i.e., `5678.0`, `5678.1`, etc.), you use `condor_q 5678`. @@ -50,7 +50,7 @@ For example, if you want to see the jobs in cluster 5678 (i.e., `5678.0`, `5678. To list a specific job (i.e., cluster.process, as in 5678.0): ``` console -username@learn $ condor_q +username@ap1 $ condor_q ``` For example, to see job ID 5678.1, you use `condor_q 5678.1`. @@ -79,7 +79,7 @@ You may have wondered why it is useful to be able to list a single job ID using If you add the `-long` option to `condor_q` (or its short form, `-l`), it will show the complete ClassAd for each selected job, instead of the one-line summary that you have seen so far. Because job ClassAds may have 80–90 attributes (or more), it probably makes the most sense to show the ClassAd for a single job at a time. And you know how to show just one job! Here is what the command looks like: ``` console -username@learn $ condor_q -long +username@ap1 $ condor_q -long ``` The output from this command is long and complex. Most of the attributes that HTCondor adds to a job are arcane and uninteresting for us now. But here are some examples of common, interesting attributes taken directly from `condor_q` output (except with some line breaks added to the `Requirements` attribute): @@ -138,7 +138,7 @@ Sometimes, you submit a job and it just sits in the queue in Idle state, never r To ask HTCondor why your job is not running, add the `-better-analyze` option to `condor_q` for the specific job. For example, for job ID 2423.0, the command is: ``` console -username@learn $ condor_q -better-analyze 2423.0 +username@ap1 $ condor_q -better-analyze 2423.0 ``` Of course, replace the job ID with your own. @@ -166,7 +166,7 @@ There is a lot of output, but a few items are worth highlighting. Here is a samp ``` file --- Schedd: learn.chtc.wisc.edu : <128.104.100.148:9618?... +-- Schedd: ap1.facility.path-cc.io : <128.105.68.66:9618?... ... Job 98096.000 defines the following attributes: @@ -215,7 +215,7 @@ There is a way to select the specific job attributes you want `condor_q` to tell To use autoformatting, use the `-af` option followed by the attribute name, for each attribute that you want to output: ``` console -username@learn $ condor_q -all -af Owner ClusterId Cmd +username@ap1 $ condor_q -all -af Owner ClusterId Cmd moate 2418 /share/test.sh cat 2421 /bin/sleep cat 2422 /bin/sleep @@ -228,7 +228,7 @@ References As suggested above, if you want to learn more about `condor_q`, you can do some reading: -- Read the `condor_q` man page or HTCondor Manual section (same text) to learn about more options -- Read about ClassAd attributes in Appendix A of the HTCondor Manual +- Read the `condor_q` man page or [HTCondor Manual section](https://htcondor.readthedocs.io/en/latest/man-pages/condor_q.html) (same text) to learn about more options +- Read about [ClassAd attributes](https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html) in the HTCondor Manual From c2c6395ba8fd883734b17d51392592151d44d949 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 1 Jul 2024 16:36:45 -0500 Subject: [PATCH 17/22] Update learn to ap1; CHTC to PATh --- docs/materials/htcondor/part1-ex9-status.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/materials/htcondor/part1-ex9-status.md b/docs/materials/htcondor/part1-ex9-status.md index b5bd884..aaa5f93 100644 --- a/docs/materials/htcondor/part1-ex9-status.md +++ b/docs/materials/htcondor/part1-ex9-status.md @@ -19,7 +19,7 @@ The `condor_status` program has many options for selecting which slots are liste Another convenient option is to list only those slots that are available now: ``` console -username@learn $ condor_status -avail +username@ap1 $ condor_status -avail ``` Of course, the individual execute machines only report their slots to the collector at certain time intervals, so this list will not reflect the up-to-the-second reality of all slots. But this limitation is true of all `condor_status` output, not just with the `-avail` option. @@ -27,25 +27,25 @@ Of course, the individual execute machines only report their slots to the collec Similar to `condor_q`, you can limit the slots that are listed in two easy ways. To list just the slots on a specific machine: ``` console -username@learn $ condor_status +username@ap1 $ condor_status ``` For example, if you want to see the slots on `e2337.chtc.wisc.edu` (in the CHTC pool): ``` console -username@learn $ condor_status e2337.chtc.wisc.edu +username@ap1 $ condor_status e2337.chtc.wisc.edu ``` To list a specific slot on a machine: ``` console -username@learn $ condor_status @ +username@ap1 $ condor_status @ ``` For example, to see the “first” slot on the machine above: ``` console -username@learn $ condor_status slot1@e2337.chtc.wisc.edu +username@ap1 $ condor_status slot1@e2337.chtc.wisc.edu ``` !!! note @@ -68,7 +68,7 @@ Viewing a Slot ClassAd Just as with `condor_q`, you can use `condor_status` to view the complete ClassAd for a given slot (often confusingly called the “machine” ad): ``` console -username@learn $ condor_status -long @ +username@ap1 $ condor_status -long @ ``` Because slot ClassAds may have 150–200 attributes (or more), it probably makes the most sense to show the ClassAd for a single slot at a time, as shown above. @@ -91,7 +91,7 @@ Memory = 1024 As you may be able to tell, there is a mix of attributes about the machine as a whole (hence the name “machine ad”) and about the slot in particular. -Go ahead and examine a machine ClassAd now. I suggest looking at one of the slots on, say, `e2337.chtc.wisc.edu` because of its relatively simple configuration. +Go ahead and examine a machine ClassAd now. Viewing Slots by ClassAd Expression ----------------------------------- @@ -101,7 +101,7 @@ Often, it is helpful to view slots that meet some particular criteria. For examp For example, suppose we want to list all slots that are running Scientific Linux 7 (operating system) and have at least 16 GB memory available. Note that memory is reported in units of Megabytes. The command is: ``` console -username@learn $ condor_status -constraint 'OpSysAndVer == "CentOS7" && Memory >= 16000' +username@ap1 $ condor_status -constraint 'OpSysAndVer == "CentOS7" && Memory >= 16000' ``` !!! note @@ -109,7 +109,7 @@ username@learn $ condor_status -constraint 'OpSysAndVer == "CentOS7" && Memory > In the example above, the single quotes (`'`) are for the shell, so that the entire expression is passed to `condor_status` untouched, and the double quotes (`"`) surround a string value within the expression itself. -Currently on CHTC, there are only a few slots that meet these criteria (our high-memory servers, mainly used for metagenomics assemblies). +Currently on PATh, there are only a few slots that meet these criteria (our high-memory servers, mainly used for metagenomics assemblies). If you are interested in learning more about writing ClassAd expressions, look at section 4.1 and especially 4.1.4 of the HTCondor Manual. This is definitely advanced material, so if you do not want to read it, that is fine. But if you do, take some time to practice writing expressions for the `condor_status -constraint` command. @@ -125,7 +125,7 @@ The `condor_status` command accepts the same `-autoformat` (`-af`) options that For example, I was curious about the host name and operating system of the slots with more than 32GB of memory: ``` console -username@learn $ condor_status -af Machine -af OpSysAndVer -constraint 'Memory >= 32000' +username@ap1 $ condor_status -af Machine -af OpSysAndVer -constraint 'Memory >= 32000' ``` If you like, spend a few minutes now or later experimenting with `condor_status` formatting. From 320f7919e97ba4325e7ca7ca07d695a79be92036 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 5 Jul 2024 10:51:43 -0500 Subject: [PATCH 18/22] Fix TOC --- docs/materials/software/part3-ex1-apptainer-recipes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/materials/software/part3-ex1-apptainer-recipes.md b/docs/materials/software/part3-ex1-apptainer-recipes.md index 2ed1600..8b64afe 100644 --- a/docs/materials/software/part3-ex1-apptainer-recipes.md +++ b/docs/materials/software/part3-ex1-apptainer-recipes.md @@ -16,8 +16,8 @@ the basic options and syntax of the "build" or definition file. |---------| | [Bootstrap/From](#where-to-start) | | [%files](#files-needed-for-building-or-running) | -| [%files](#commnds-to-install) | -| [%files](#environment) | +| [%post](#commands-to-install) | +| [%env](#environment) | Where to start From 840cf5fb9eb92238aeedfb89984b315e67263434 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 5 Jul 2024 17:21:14 -0500 Subject: [PATCH 19/22] Update tar.gz address --- docs/materials/scaling/part1-ex1-organization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/materials/scaling/part1-ex1-organization.md b/docs/materials/scaling/part1-ex1-organization.md index 6acd606..81438c2 100644 --- a/docs/materials/scaling/part1-ex1-organization.md +++ b/docs/materials/scaling/part1-ex1-organization.md @@ -16,7 +16,7 @@ Make sure you are logged into `ap40.uw.osg-htc.org`. To get the files for this exercise: -1. Type `wget https://github.com/osg-htc/user-school-2023/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz` to download the tarball. +1. Type `wget https://github.com/osg-htc/school-2024/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz` to download the tarball. 1. As you learned earlier, expand this tarball file; it will create a `organizing-files` directory. 1. Change to that directory, or create a separate one for this exercise and copy the files in. From ceea6be7f3397154f9727820358e3cb61ed483a2 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 5 Jul 2024 17:23:59 -0500 Subject: [PATCH 20/22] Update 2023 to 2024 --- .../workflows/part1-ex1-simple-dag.md | 240 +++++++++--------- 1 file changed, 120 insertions(+), 120 deletions(-) diff --git a/docs/materials/workflows/part1-ex1-simple-dag.md b/docs/materials/workflows/part1-ex1-simple-dag.md index e65710e..8458893 100644 --- a/docs/materials/workflows/part1-ex1-simple-dag.md +++ b/docs/materials/workflows/part1-ex1-simple-dag.md @@ -82,153 +82,153 @@ In the third window, watch what DAGMan does (what you see may be slightly differ ``` console username@ap40 $ tail -f --lines=500 simple.dag.dagman.out -08/02/23 15:44:57 ****************************************************** -08/02/23 15:44:57 ** condor_scheduniv_exec.271100.0 (CONDOR_DAGMAN) STARTING UP -08/02/23 15:44:57 ** /usr/bin/condor_dagman -08/02/23 15:44:57 ** SubsystemInfo: name=DAGMAN type=DAGMAN(9) class=CLIENT(2) -08/02/23 15:44:57 ** Configuration: subsystem:DAGMAN local: class:CLIENT -08/02/23 15:44:57 ** $CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ -08/02/23 15:44:57 ** $CondorPlatform: x86_64_AlmaLinux8 $ -08/02/23 15:44:57 ** PID = 2340103 -08/02/23 15:44:57 ** Log last touched time unavailable (No such file or directory) -08/02/23 15:44:57 ****************************************************** -08/02/23 15:44:57 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS -08/02/23 15:44:57 DaemonCore: No command port requested. -08/02/23 15:44:57 DAGMAN_USE_STRICT setting: 1 -08/02/23 15:44:57 DAGMAN_VERBOSITY setting: 3 -08/02/23 15:44:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 -08/02/23 15:44:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False -08/02/23 15:44:57 DAGMAN_SUBMIT_DELAY setting: 0 -08/02/23 15:44:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 -08/02/23 15:44:57 DAGMAN_STARTUP_CYCLE_DETECT setting: False -08/02/23 15:44:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 100 -08/02/23 15:44:57 DAGMAN_AGGRESSIVE_SUBMIT setting: False -08/02/23 15:44:57 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 -08/02/23 15:44:57 DAGMAN_QUEUE_UPDATE_INTERVAL setting: 300 -08/02/23 15:44:57 DAGMAN_DEFAULT_PRIORITY setting: 0 -08/02/23 15:44:57 DAGMAN_SUPPRESS_NOTIFICATION setting: True -08/02/23 15:44:57 allow_events (DAGMAN_ALLOW_EVENTS) setting: 114 -08/02/23 15:44:57 DAGMAN_RETRY_SUBMIT_FIRST setting: True -08/02/23 15:44:57 DAGMAN_RETRY_NODE_FIRST setting: False -08/02/23 15:44:57 DAGMAN_MAX_JOBS_IDLE setting: 1000 -08/02/23 15:44:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 -08/02/23 15:44:57 DAGMAN_MAX_PRE_SCRIPTS setting: 20 -08/02/23 15:44:57 DAGMAN_MAX_POST_SCRIPTS setting: 20 -08/02/23 15:44:57 DAGMAN_MAX_HOLD_SCRIPTS setting: 20 -08/02/23 15:44:57 DAGMAN_MUNGE_NODE_NAMES setting: True -08/02/23 15:44:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: False -08/02/23 15:44:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: False -08/02/23 15:44:57 DAGMAN_ALWAYS_RUN_POST setting: False -08/02/23 15:44:57 DAGMAN_CONDOR_SUBMIT_EXE setting: /usr/bin/condor_submit -08/02/23 15:44:57 DAGMAN_USE_DIRECT_SUBMIT setting: True -08/02/23 15:44:57 DAGMAN_DEFAULT_APPEND_VARS setting: False -08/02/23 15:44:57 DAGMAN_ABORT_DUPLICATES setting: True -08/02/23 15:44:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True -08/02/23 15:44:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 -08/02/23 15:44:57 DAGMAN_AUTO_RESCUE setting: True -08/02/23 15:44:57 DAGMAN_MAX_RESCUE_NUM setting: 100 -08/02/23 15:44:57 DAGMAN_WRITE_PARTIAL_RESCUE setting: True -08/02/23 15:44:57 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log -08/02/23 15:44:57 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True -08/02/23 15:44:57 DAGMAN_MAX_JOB_HOLDS setting: 100 -08/02/23 15:44:57 DAGMAN_HOLD_CLAIM_TIME setting: 20 -08/02/23 15:44:57 ALL_DEBUG setting: -08/02/23 15:44:57 DAGMAN_DEBUG setting: -08/02/23 15:44:57 DAGMAN_SUPPRESS_JOB_LOGS setting: False -08/02/23 15:44:57 DAGMAN_REMOVE_NODE_JOBS setting: True -08/02/23 15:44:57 DAGMAN will adjust edges after parsing -08/02/23 15:44:57 argv[0] == "condor_scheduniv_exec.271100.0" -08/02/23 15:44:57 argv[1] == "-Lockfile" -08/02/23 15:44:57 argv[2] == "simple.dag.lock" -08/02/23 15:44:57 argv[3] == "-AutoRescue" -08/02/23 15:44:57 argv[4] == "1" -08/02/23 15:44:57 argv[5] == "-DoRescueFrom" -08/02/23 15:44:57 argv[6] == "0" -08/02/23 15:44:57 argv[7] == "-Dag" -08/02/23 15:44:57 argv[8] == "simple.dag" -08/02/23 15:44:57 argv[9] == "-Suppress_notification" -08/02/23 15:44:57 argv[10] == "-CsdVersion" -08/02/23 15:44:57 argv[11] == "$CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $" -08/02/23 15:44:57 argv[12] == "-Dagman" -08/02/23 15:44:57 argv[13] == "/usr/bin/condor_dagman" -08/02/23 15:44:57 Default node log file is: -08/02/23 15:44:57 DAG Lockfile will be written to simple.dag.lock -08/02/23 15:44:57 DAG Input file is simple.dag -08/02/23 15:44:57 Parsing 1 dagfiles -08/02/23 15:44:57 Parsing simple.dag ... -08/02/23 15:44:57 Adjusting edges -08/02/23 15:44:57 Dag contains 1 total jobs -08/02/23 15:44:57 Bootstrapping... -08/02/23 15:44:57 Number of pre-completed nodes: 0 -08/02/23 15:44:57 MultiLogFiles: truncating log file /home/mats.rynge/dagman-1/./simple.dag.nodes.log -08/02/23 15:44:57 DAG status: 0 (DAG_STATUS_OK) -08/02/23 15:44:57 Of 1 nodes total: -08/02/23 15:44:57 Done Pre Queued Post Ready Un-Ready Failed Futile -08/02/23 15:44:57 === === === === === === === === -08/02/23 15:44:57 0 0 0 0 1 0 0 0 -08/02/23 15:44:57 0 job proc(s) currently held -08/02/23 15:44:57 Registering condor_event_timer... -08/02/23 15:44:58 Submitting HTCondor Node Simple job(s)... +08/02/24 15:44:57 ****************************************************** +08/02/24 15:44:57 ** condor_scheduniv_exec.271100.0 (CONDOR_DAGMAN) STARTING UP +08/02/24 15:44:57 ** /usr/bin/condor_dagman +08/02/24 15:44:57 ** SubsystemInfo: name=DAGMAN type=DAGMAN(9) class=CLIENT(2) +08/02/24 15:44:57 ** Configuration: subsystem:DAGMAN local: class:CLIENT +08/02/24 15:44:57 ** $CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ +08/02/24 15:44:57 ** $CondorPlatform: x86_64_AlmaLinux8 $ +08/02/24 15:44:57 ** PID = 2340103 +08/02/24 15:44:57 ** Log last touched time unavailable (No such file or directory) +08/02/24 15:44:57 ****************************************************** +08/02/24 15:44:57 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS +08/02/24 15:44:57 DaemonCore: No command port requested. +08/02/24 15:44:57 DAGMAN_USE_STRICT setting: 1 +08/02/24 15:44:57 DAGMAN_VERBOSITY setting: 3 +08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 +08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False +08/02/24 15:44:57 DAGMAN_SUBMIT_DELAY setting: 0 +08/02/24 15:44:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 +08/02/24 15:44:57 DAGMAN_STARTUP_CYCLE_DETECT setting: False +08/02/24 15:44:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 100 +08/02/24 15:44:57 DAGMAN_AGGRESSIVE_SUBMIT setting: False +08/02/24 15:44:57 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 +08/02/24 15:44:57 DAGMAN_QUEUE_UPDATE_INTERVAL setting: 300 +08/02/24 15:44:57 DAGMAN_DEFAULT_PRIORITY setting: 0 +08/02/24 15:44:57 DAGMAN_SUPPRESS_NOTIFICATION setting: True +08/02/24 15:44:57 allow_events (DAGMAN_ALLOW_EVENTS) setting: 114 +08/02/24 15:44:57 DAGMAN_RETRY_SUBMIT_FIRST setting: True +08/02/24 15:44:57 DAGMAN_RETRY_NODE_FIRST setting: False +08/02/24 15:44:57 DAGMAN_MAX_JOBS_IDLE setting: 1000 +08/02/24 15:44:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 +08/02/24 15:44:57 DAGMAN_MAX_PRE_SCRIPTS setting: 20 +08/02/24 15:44:57 DAGMAN_MAX_POST_SCRIPTS setting: 20 +08/02/24 15:44:57 DAGMAN_MAX_HOLD_SCRIPTS setting: 20 +08/02/24 15:44:57 DAGMAN_MUNGE_NODE_NAMES setting: True +08/02/24 15:44:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: False +08/02/24 15:44:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: False +08/02/24 15:44:57 DAGMAN_ALWAYS_RUN_POST setting: False +08/02/24 15:44:57 DAGMAN_CONDOR_SUBMIT_EXE setting: /usr/bin/condor_submit +08/02/24 15:44:57 DAGMAN_USE_DIRECT_SUBMIT setting: True +08/02/24 15:44:57 DAGMAN_DEFAULT_APPEND_VARS setting: False +08/02/24 15:44:57 DAGMAN_ABORT_DUPLICATES setting: True +08/02/24 15:44:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True +08/02/24 15:44:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 +08/02/24 15:44:57 DAGMAN_AUTO_RESCUE setting: True +08/02/24 15:44:57 DAGMAN_MAX_RESCUE_NUM setting: 100 +08/02/24 15:44:57 DAGMAN_WRITE_PARTIAL_RESCUE setting: True +08/02/24 15:44:57 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log +08/02/24 15:44:57 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True +08/02/24 15:44:57 DAGMAN_MAX_JOB_HOLDS setting: 100 +08/02/24 15:44:57 DAGMAN_HOLD_CLAIM_TIME setting: 20 +08/02/24 15:44:57 ALL_DEBUG setting: +08/02/24 15:44:57 DAGMAN_DEBUG setting: +08/02/24 15:44:57 DAGMAN_SUPPRESS_JOB_LOGS setting: False +08/02/24 15:44:57 DAGMAN_REMOVE_NODE_JOBS setting: True +08/02/24 15:44:57 DAGMAN will adjust edges after parsing +08/02/24 15:44:57 argv[0] == "condor_scheduniv_exec.271100.0" +08/02/24 15:44:57 argv[1] == "-Lockfile" +08/02/24 15:44:57 argv[2] == "simple.dag.lock" +08/02/24 15:44:57 argv[3] == "-AutoRescue" +08/02/24 15:44:57 argv[4] == "1" +08/02/24 15:44:57 argv[5] == "-DoRescueFrom" +08/02/24 15:44:57 argv[6] == "0" +08/02/24 15:44:57 argv[7] == "-Dag" +08/02/24 15:44:57 argv[8] == "simple.dag" +08/02/24 15:44:57 argv[9] == "-Suppress_notification" +08/02/24 15:44:57 argv[10] == "-CsdVersion" +08/02/24 15:44:57 argv[11] == "$CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $" +08/02/24 15:44:57 argv[12] == "-Dagman" +08/02/24 15:44:57 argv[13] == "/usr/bin/condor_dagman" +08/02/24 15:44:57 Default node log file is: +08/02/24 15:44:57 DAG Lockfile will be written to simple.dag.lock +08/02/24 15:44:57 DAG Input file is simple.dag +08/02/24 15:44:57 Parsing 1 dagfiles +08/02/24 15:44:57 Parsing simple.dag ... +08/02/24 15:44:57 Adjusting edges +08/02/24 15:44:57 Dag contains 1 total jobs +08/02/24 15:44:57 Bootstrapping... +08/02/24 15:44:57 Number of pre-completed nodes: 0 +08/02/24 15:44:57 MultiLogFiles: truncating log file /home/mats.rynge/dagman-1/./simple.dag.nodes.log +08/02/24 15:44:57 DAG status: 0 (DAG_STATUS_OK) +08/02/24 15:44:57 Of 1 nodes total: +08/02/24 15:44:57 Done Pre Queued Post Ready Un-Ready Failed Futile +08/02/24 15:44:57 === === === === === === === === +08/02/24 15:44:57 0 0 0 0 1 0 0 0 +08/02/24 15:44:57 0 job proc(s) currently held +08/02/24 15:44:57 Registering condor_event_timer... +08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... ``` **Here's where the job is submitted** ```file -08/02/23 15:44:58 Submitting HTCondor Node Simple job(s)... -08/02/23 15:44:58 Submitting node Simple from file job.sub using direct job submission -08/02/23 15:44:58 assigned HTCondor ID (271101.0.0) -08/02/23 15:44:58 Just submitted 1 job this cycle... +08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... +08/02/24 15:44:58 Submitting node Simple from file job.sub using direct job submission +08/02/24 15:44:58 assigned HTCondor ID (271101.0.0) +08/02/24 15:44:58 Just submitted 1 job this cycle... ``` **Here's where DAGMan noticed that the job is running** ```file -08/02/23 15:45:18 Event: ULOG_EXECUTE for HTCondor Node Simple (271101.0.0) {08/02/23 15:45:14} -08/02/23 15:45:18 Number of idle job procs: 0 +08/02/24 15:45:18 Event: ULOG_EXECUTE for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:14} +08/02/24 15:45:18 Number of idle job procs: 0 ``` **Here's where DAGMan noticed that the job finished.** ```file -08/02/23 15:45:23 Event: ULOG_JOB_TERMINATED for HTCondor Node Simple (271101.0.0) {08/02/23 15:45:19} -08/02/23 15:45:23 Number of idle job procs: 0 -08/02/23 15:45:23 Node Simple job proc (271101.0.0) completed successfully. -08/02/23 15:45:23 Node Simple job completed -08/02/23 15:45:23 DAG status: 0 (DAG_STATUS_OK) -08/02/23 15:45:23 Of 1 nodes total: -08/02/23 15:45:23 Done Pre Queued Post Ready Un-Ready Failed Futile -08/02/23 15:45:23 === === === === === === === === -08/02/23 15:45:23 1 0 0 0 0 0 0 0 +08/02/24 15:45:23 Event: ULOG_JOB_TERMINATED for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:19} +08/02/24 15:45:23 Number of idle job procs: 0 +08/02/24 15:45:23 Node Simple job proc (271101.0.0) completed successfully. +08/02/24 15:45:23 Node Simple job completed +08/02/24 15:45:23 DAG status: 0 (DAG_STATUS_OK) +08/02/24 15:45:23 Of 1 nodes total: +08/02/24 15:45:23 Done Pre Queued Post Ready Un-Ready Failed Futile +08/02/24 15:45:23 === === === === === === === === +08/02/24 15:45:23 1 0 0 0 0 0 0 0 ``` **Here's where DAGMan noticed that all the work is done.** ```file -08/02/23 15:45:23 All jobs Completed! -08/02/23 15:45:23 Note: 0 total job deferrals because of -MaxJobs limit (0) -08/02/23 15:45:23 Note: 0 total job deferrals because of -MaxIdle limit (1000) -08/02/23 15:45:23 Note: 0 total job deferrals because of node category throttles -08/02/23 15:45:23 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER -08/02/23 15:45:23 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER -08/02/23 15:45:23 Note: 0 total HOLD script deferrals because of -MaxHold limit (20) or DEFER +08/02/24 15:45:23 All jobs Completed! +08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxJobs limit (0) +08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxIdle limit (1000) +08/02/24 15:45:23 Note: 0 total job deferrals because of node category throttles +08/02/24 15:45:23 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER +08/02/24 15:45:23 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER +08/02/24 15:45:23 Note: 0 total HOLD script deferrals because of -MaxHold limit (20) or DEFER ``` Now verify your results: ``` console username@ap40 $ cat simple.log -000 (271101.000.000) 2023-08-02 15:44:58 Job submitted from host: <128.105.68.92:9618?addrs=128.105.68.92-9618+[2607-f388-2200-100-eaeb-d3ff-fe40-111c]-9618&alias=ap40.uw.osg-htc.org&noUDP&sock=schedd_35391_dc5c> +000 (271101.000.000) 2024-08-02 15:44:58 Job submitted from host: <128.105.68.92:9618?addrs=128.105.68.92-9618+[2607-f388-2200-100-eaeb-d3ff-fe40-111c]-9618&alias=ap40.uw.osg-htc.org&noUDP&sock=schedd_35391_dc5c> DAG Node: Simple ... -040 (271101.000.000) 2023-08-02 15:45:13 Started transferring input files +040 (271101.000.000) 2024-08-02 15:45:13 Started transferring input files Transferring to host: <10.136.81.233:37425?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector4#23067238%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b6]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1512850&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-37425&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> ... -040 (271101.000.000) 2023-08-02 15:45:13 Finished transferring input files +040 (271101.000.000) 2024-08-02 15:45:13 Finished transferring input files ... -021 (271101.000.000) 2023-08-02 15:45:14 Warning from starter on slot1_4@glidein_2635188_104012775@comp-cc-0463.gwave.ics.psu.edu: +021 (271101.000.000) 2024-08-02 15:45:14 Warning from starter on slot1_4@glidein_2635188_104012775@comp-cc-0463.gwave.ics.psu.edu: PREPARE_JOB (prepare-hook) succeeded (reported status 000): Using default Singularity image /cvmfs/singularity.opensciencegrid.org/htc/rocky:8-cuda-11.0.3 ... -001 (271101.000.000) 2023-08-02 15:45:14 Job executing on host: <10.136.81.233:39645?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector10#1506459%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b4]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1506644&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-39645&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> +001 (271101.000.000) 2024-08-02 15:45:14 Job executing on host: <10.136.81.233:39645?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector10#1506459%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b4]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1506644&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-39645&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> SlotName: slot1_4@comp-cc-0463.gwave.ics.psu.edu CondorScratchDir = "/localscratch/condor/execute/dir_2635172/glide_uZ6qXM/execute/dir_3252113" Cpus = 1 @@ -236,15 +236,15 @@ username@ap40 $ cat simple.log GLIDEIN_ResourceName = "PSU-LIGO" Memory = 1024 ... -006 (271101.000.000) 2023-08-02 15:45:19 Image size of job updated: 2296464 +006 (271101.000.000) 2024-08-02 15:45:19 Image size of job updated: 2296464 47 - MemoryUsage of job (MB) 47684 - ResidentSetSize of job (KB) ... -040 (271101.000.000) 2023-08-02 15:45:19 Started transferring output files +040 (271101.000.000) 2024-08-02 15:45:19 Started transferring output files ... -040 (271101.000.000) 2023-08-02 15:45:19 Finished transferring output files +040 (271101.000.000) 2024-08-02 15:45:19 Finished transferring output files ... -005 (271101.000.000) 2023-08-02 15:45:19 Job terminated. +005 (271101.000.000) 2024-08-02 15:45:19 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage @@ -259,7 +259,7 @@ username@ap40 $ cat simple.log Disk (KB) : 149 1048576 2699079 Memory (MB) : 47 1024 1024 - Job terminated of its own accord at 2023-08-02T20:45:19Z with exit-code 0. + Job terminated of its own accord at 2024-08-02T20:45:19Z with exit-code 0. ... ``` @@ -287,7 +287,7 @@ remove_kill_sig = SIGUSR1 # is killed (e.g., during a reboot). on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False -arguments = "-p 0 -f -l . -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -Suppress_notification -CsdVersion $CondorVersion:' '10.7.0' '2023-07-10' 'BuildID:' '659788' 'PackageID:' '10.7.0-0.659788' 'RC' '$ -Dagman /usr/bin/condor_dagman" +arguments = "-p 0 -f -l . -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -Suppress_notification -CsdVersion $CondorVersion:' '10.7.0' '2024-07-10' 'BuildID:' '659788' 'PackageID:' '10.7.0-0.659788' 'RC' '$ -Dagman /usr/bin/condor_dagman" environment = "_CONDOR_DAGMAN_LOG=simple.dag.dagman.out _CONDOR_MAX_DAGMAN_LOG=0 _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address _CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad" queue ``` From 955664e15866d30b2342a7c3d2dc8f55211eef78 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 5 Jul 2024 17:25:39 -0500 Subject: [PATCH 21/22] Update 2021 to 2024 --- docs/materials/workflows/part1-ex3-complex-dag.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/materials/workflows/part1-ex3-complex-dag.md b/docs/materials/workflows/part1-ex3-complex-dag.md index 08df972..fb8a310 100644 --- a/docs/materials/workflows/part1-ex3-complex-dag.md +++ b/docs/materials/workflows/part1-ex3-complex-dag.md @@ -146,7 +146,7 @@ Let’s follow the progress of the whole DAG: Total: 1 jobs; 1 running - Updated at 2021-07-28 13:52:57 + Updated at 2024-07-28 13:52:57 **DAGMan has submitted the goatbrot jobs, but they haven't started running yet** @@ -158,7 +158,7 @@ Let’s follow the progress of the whole DAG: Total: 5 jobs; 4 idle, 1 running - Updated at 2021-07-28 13:53:53 + Updated at 2024-07-28 13:53:53 **They're running** @@ -170,7 +170,7 @@ Let’s follow the progress of the whole DAG: Total: 5 jobs; 5 running - Updated at 2021-07-28 13:54:33 + Updated at 2024-07-28 13:54:33 **They finished, but DAGMan hasn't noticed yet. It only checks periodically:** @@ -182,7 +182,7 @@ Let’s follow the progress of the whole DAG: Total: 5 jobs; 4 completed, 1 running - Updated at 2021-07-28 13:55:13 + Updated at 2024-07-28 13:55:13 Eventually, you'll see the montage job submitted, then running, then leave the queue, and then DAGMan will leave the queue. From 1cb7d1b3e93c01a4abbeacc1983f4838b0d0b4e2 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 5 Jul 2024 17:29:28 -0500 Subject: [PATCH 22/22] learn->ap40, dates->2024, update condor_version --- .../workflows/part1-ex4-failed-dag.md | 202 +++++++++--------- 1 file changed, 101 insertions(+), 101 deletions(-) diff --git a/docs/materials/workflows/part1-ex4-failed-dag.md b/docs/materials/workflows/part1-ex4-failed-dag.md index 0011d12..d2e2c04 100644 --- a/docs/materials/workflows/part1-ex4-failed-dag.md +++ b/docs/materials/workflows/part1-ex4-failed-dag.md @@ -33,7 +33,7 @@ queue Submit the DAG again: ``` console -username@learn $ condor_submit_dag goatbrot.dag +username@ap40 $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out @@ -49,50 +49,50 @@ Submitting job(s). Use watch to watch the jobs until they finish. In a separate window, use `tail --lines=500 -f goatbrot.dag.dagman.out` to watch what DAGMan does. ``` console -06/22/12 17:57:41 Setting maximum accepts per cycle 8. -06/22/12 17:57:41 ****************************************************** -06/22/12 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP -06/22/12 17:57:41 ** /usr/bin/condor_dagman -06/22/12 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) -06/22/12 17:57:41 ** Configuration: subsystem:DAGMAN local: class:DAEMON -06/22/12 17:57:41 ** $CondorVersion: 7.7.6 Apr 16 2012 BuildID: 34175 PRE-RELEASE-UWCS $ -06/22/12 17:57:41 ** $CondorPlatform: x86_64_rhap_5.7 $ -06/22/12 17:57:41 ** PID = 26867 -06/22/12 17:57:41 ** Log last touched time unavailable (No such file or directory) -06/22/12 17:57:41 ****************************************************** -06/22/12 17:57:41 Using config source: /etc/condor/condor_config -06/22/12 17:57:41 Using local config sources: -06/22/12 17:57:41 /etc/condor/config.d/00-chtc-global.conf -06/22/12 17:57:41 /etc/condor/config.d/01-chtc-submit.conf -06/22/12 17:57:41 /etc/condor/config.d/02-chtc-flocking.conf -06/22/12 17:57:41 /etc/condor/config.d/03-chtc-jobrouter.conf -06/22/12 17:57:41 /etc/condor/config.d/04-chtc-blacklist.conf -06/22/12 17:57:41 /etc/condor/config.d/99-osg-ss-group.conf -06/22/12 17:57:41 /etc/condor/config.d/99-roy-extras.conf -06/22/12 17:57:41 /etc/condor/condor_config.local +06/22/24 17:57:41 Setting maximum accepts per cycle 8. +06/22/24 17:57:41 ****************************************************** +06/22/24 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP +06/22/24 17:57:41 ** /usr/bin/condor_dagman +06/22/24 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) +06/22/24 17:57:41 ** Configuration: subsystem:DAGMAN local: class:DAEMON +06/22/24 17:57:41 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ +06/22/24 17:57:41 ** $CondorPlatform: x86_64_AlmaLinux9 $ +06/22/24 17:57:41 ** PID = 26867 +06/22/24 17:57:41 ** Log last touched time unavailable (No such file or directory) +06/22/24 17:57:41 ****************************************************** +06/22/24 17:57:41 Using config source: /etc/condor/condor_config +06/22/24 17:57:41 Using local config sources: +06/22/24 17:57:41 /etc/condor/config.d/00-chtc-global.conf +06/22/24 17:57:41 /etc/condor/config.d/01-chtc-submit.conf +06/22/24 17:57:41 /etc/condor/config.d/02-chtc-flocking.conf +06/22/24 17:57:41 /etc/condor/config.d/03-chtc-jobrouter.conf +06/22/24 17:57:41 /etc/condor/config.d/04-chtc-blacklist.conf +06/22/24 17:57:41 /etc/condor/config.d/99-osg-ss-group.conf +06/22/24 17:57:41 /etc/condor/config.d/99-roy-extras.conf +06/22/24 17:57:41 /etc/condor/condor_config.local ``` Below is where DAGMan realizes that the montage node failed: ```console -06/22/12 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0) -06/22/12 18:08:42 Number of idle job procs: 0 -06/22/12 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0) -06/22/12 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0) -06/22/12 18:08:42 Node montage job proc (82.0.0) failed with status 1. -06/22/12 18:08:42 Number of idle job procs: 0 -06/22/12 18:08:42 Of 5 nodes total: -06/22/12 18:08:42 Done Pre Queued Post Ready Un-Ready Failed -06/22/12 18:08:42 === === === === === === === -06/22/12 18:08:42 4 0 0 0 0 0 1 -06/22/12 18:08:42 0 job proc(s) currently held -06/22/12 18:08:42 Aborting DAG... -06/22/12 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001... -06/22/12 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0) -06/22/12 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0) -06/22/12 18:08:42 Note: 0 total job deferrals because of node category throttles -06/22/12 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) -06/22/12 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) -06/22/12 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1 +06/22/24 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0) +06/22/24 18:08:42 Number of idle job procs: 0 +06/22/24 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0) +06/22/24 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0) +06/22/24 18:08:42 Node montage job proc (82.0.0) failed with status 1. +06/22/24 18:08:42 Number of idle job procs: 0 +06/22/24 18:08:42 Of 5 nodes total: +06/22/24 18:08:42 Done Pre Queued Post Ready Un-Ready Failed +06/22/24 18:08:42 === === === === === === === +06/22/24 18:08:42 4 0 0 0 0 0 1 +06/22/24 18:08:42 0 job proc(s) currently held +06/22/24 18:08:42 Aborting DAG... +06/22/24 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001... +06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0) +06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0) +06/22/24 18:08:42 Note: 0 total job deferrals because of node category throttles +06/22/24 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) +06/22/24 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) +06/22/24 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1 ``` DAGMan notices that one of the jobs failed because its exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved. Do you see the part where it wrote the rescue DAG? @@ -100,10 +100,10 @@ DAGMan notices that one of the jobs failed because its exit code was non-zero. D Look at the rescue DAG file. It's called a partial DAG because it indicates what part of the DAG has already been completed. ``` console -username@learn $ cat goatbrot.dag.rescue001 +username@ap40 $ cat goatbrot.dag.rescue001 # Rescue DAG file, created after running # the goatbrot.dag DAG file -# Created 6/22/2012 23:08:42 UTC +# Created 6/22/2024 23:08:42 UTC # Rescue DAG version: 2.0.1 (partial) # # Total number of Nodes: 5 @@ -135,7 +135,7 @@ queue Now we can re-submit our original DAG and DAGMan will pick up where it left off. It will automatically notice the rescue DAG. If you didn't fix the problem, DAGMan would generate another rescue DAG. ``` console -username@learn $ condor_submit_dag goatbrot.dag +username@ap40 $ condor_submit_dag goatbrot.dag Running rescue DAG 1 ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub @@ -148,47 +148,47 @@ Submitting job(s). 1 job(s) submitted to cluster 83. ----------------------------------------------------------------------- -username@learn $ tail -f goatbrot.dag.dagman.out -06/23/12 11:30:53 ****************************************************** -06/23/12 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP -06/23/12 11:30:53 ** /usr/bin/condor_dagman -06/23/12 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) -06/23/12 11:30:53 ** Configuration: subsystem:DAGMAN local: class:DAEMON -06/23/12 11:30:53 ** $CondorVersion: 7.7.6 Apr 16 2012 BuildID: 34175 PRE-RELEASE-UWCS $ -06/23/12 11:30:53 ** $CondorPlatform: x86_64_rhap_5.7 $ -06/23/12 11:30:53 ** PID = 28576 -06/23/12 11:30:53 ** Log last touched 6/22 18:08:42 -06/23/12 11:30:53 ****************************************************** -06/23/12 11:30:53 Using config source: /etc/condor/condor_config +username@ap40 $ tail -f goatbrot.dag.dagman.out +06/23/24 11:30:53 ****************************************************** +06/23/24 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP +06/23/24 11:30:53 ** /usr/bin/condor_dagman +06/23/24 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) +06/23/24 11:30:53 ** Configuration: subsystem:DAGMAN local: class:DAEMON +06/23/24 11:30:53 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ +06/23/24 11:30:53 ** $CondorPlatform: x86_64_AlmaLinux9 $ +06/23/24 11:30:53 ** PID = 28576 +06/23/24 11:30:53 ** Log last touched 6/22 18:08:42 +06/23/24 11:30:53 ****************************************************** +06/23/24 11:30:53 Using config source: /etc/condor/condor_config ... ``` **Here is where DAGMAN notices that there is a rescue DAG** ```hl_lines="3" -06/23/12 11:30:53 Parsing 1 dagfiles -06/23/12 11:30:53 Parsing goatbrot.dag ... -06/23/12 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file -06/23/12 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -06/23/12 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001 -06/23/12 11:30:53 Dag contains 5 total jobs +06/23/24 11:30:53 Parsing 1 dagfiles +06/23/24 11:30:53 Parsing goatbrot.dag ... +06/23/24 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file +06/23/24 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +06/23/24 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001 +06/23/24 11:30:53 Dag contains 5 total jobs ``` **Shortly thereafter it sees that four jobs have already finished.** ```console -06/23/12 11:31:05 Bootstrapping... -06/23/12 11:31:05 Number of pre-completed nodes: 4 -06/23/12 11:31:05 Registering condor_event_timer... -06/23/12 11:31:06 Sleeping for one second for log file consistency -06/23/12 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log +06/23/24 11:31:05 Bootstrapping... +06/23/24 11:31:05 Number of pre-completed nodes: 4 +06/23/24 11:31:05 Registering condor_event_timer... +06/23/24 11:31:06 Sleeping for one second for log file consistency +06/23/24 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log ``` **Here is where DAGMan resubmits the montage job and waits for it to complete.** ```console -06/23/12 11:31:07 Submitting Condor Node montage job(s)... -06/23/12 11:31:07 submitting: condor_submit +06/23/24 11:31:07 Submitting Condor Node montage job(s)... +06/23/24 11:31:07 submitting: condor_submit -a dag_node_name' '=' 'montage -a +DAGManJobId' '=' '83 -a DAGManJobId' '=' '83 @@ -197,48 +197,48 @@ username@learn $ tail -f goatbrot.dag.dagman.out -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"g1,g2,g3,g4" montage.sub -06/23/12 11:31:07 From submit: Submitting job(s). -06/23/12 11:31:07 From submit: 1 job(s) submitted to cluster 84. -06/23/12 11:31:07 assigned Condor ID (84.0.0) -06/23/12 11:31:07 Just submitted 1 job this cycle... -06/23/12 11:31:07 Currently monitoring 1 Condor log file(s) -06/23/12 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0) -06/23/12 11:31:07 Number of idle job procs: 1 -06/23/12 11:31:07 Of 5 nodes total: -06/23/12 11:31:07 Done Pre Queued Post Ready Un-Ready Failed -06/23/12 11:31:07 === === === === === === === -06/23/12 11:31:07 4 0 1 0 0 0 0 -06/23/12 11:31:07 0 job proc(s) currently held -06/23/12 11:40:22 Currently monitoring 1 Condor log file(s) -06/23/12 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0) -06/23/12 11:40:22 Number of idle job procs: 0 -06/23/12 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0) -06/23/12 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0) +06/23/24 11:31:07 From submit: Submitting job(s). +06/23/24 11:31:07 From submit: 1 job(s) submitted to cluster 84. +06/23/24 11:31:07 assigned Condor ID (84.0.0) +06/23/24 11:31:07 Just submitted 1 job this cycle... +06/23/24 11:31:07 Currently monitoring 1 Condor log file(s) +06/23/24 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0) +06/23/24 11:31:07 Number of idle job procs: 1 +06/23/24 11:31:07 Of 5 nodes total: +06/23/24 11:31:07 Done Pre Queued Post Ready Un-Ready Failed +06/23/24 11:31:07 === === === === === === === +06/23/24 11:31:07 4 0 1 0 0 0 0 +06/23/24 11:31:07 0 job proc(s) currently held +06/23/24 11:40:22 Currently monitoring 1 Condor log file(s) +06/23/24 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0) +06/23/24 11:40:22 Number of idle job procs: 0 +06/23/24 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0) +06/23/24 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0) ``` **This is where the montage finished.** ```console -06/23/12 11:40:22 Node montage job proc (84.0.0) completed successfully. -06/23/12 11:40:22 Node montage job completed -06/23/12 11:40:22 Number of idle job procs: 0 -06/23/12 11:40:22 Of 5 nodes total: -06/23/12 11:40:22 Done Pre Queued Post Ready Un-Ready Failed -06/23/12 11:40:22 === === === === === === === -06/23/12 11:40:22 5 0 0 0 0 0 0 -06/23/12 11:40:22 0 job proc(s) currently held +06/23/24 11:40:22 Node montage job proc (84.0.0) completed successfully. +06/23/24 11:40:22 Node montage job completed +06/23/24 11:40:22 Number of idle job procs: 0 +06/23/24 11:40:22 Of 5 nodes total: +06/23/24 11:40:22 Done Pre Queued Post Ready Un-Ready Failed +06/23/24 11:40:22 === === === === === === === +06/23/24 11:40:22 5 0 0 0 0 0 0 +06/23/24 11:40:22 0 job proc(s) currently held ``` **And here DAGMan decides that the work is all done.** ```console -06/23/12 11:40:22 All jobs Completed! -06/23/12 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0) -06/23/12 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0) -06/23/12 11:40:22 Note: 0 total job deferrals because of node category throttles -06/23/12 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) -06/23/12 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0) -06/23/12 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0 +06/23/24 11:40:22 All jobs Completed! +06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0) +06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0) +06/23/24 11:40:22 Note: 0 total job deferrals because of node category throttles +06/23/24 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) +06/23/24 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0) +06/23/24 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0 ``` Success! Now go ahead and clean up.