diff --git a/docs/materials/checkpoint/files/.empty b/docs/materials/checkpoint/files/.empty new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/docs/materials/checkpoint/files/.empty @@ -0,0 +1 @@ + diff --git a/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pdf b/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pdf new file mode 100644 index 0000000..16b2624 Binary files /dev/null and b/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pdf differ diff --git a/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pptx b/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pptx new file mode 100644 index 0000000..38feece Binary files /dev/null and b/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pptx differ diff --git a/docs/materials/checkpoint/part1-ex1-checkpointing.md b/docs/materials/checkpoint/part1-ex1-checkpointing.md new file mode 100644 index 0000000..6e96681 --- /dev/null +++ b/docs/materials/checkpoint/part1-ex1-checkpointing.md @@ -0,0 +1,145 @@ +--- +status: testing +--- + +# Self-Checkpointing Exercise 1.1: Trying It Out + +The goal of this exercise is to practice writing a submit file for self-checkpointing, +and to see the process in action. + +## Calculating Fibonacci numbers … slowly + +The sample code for this exercise calculates +[the Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number) +resulting from a given set of iterations. +Because this is a trival computation, +the code includes a delay in each iteration through the main loop; +this simulates a more intensive computation. + +To get set up: + +1. Log in to `ap40.uw.osg-htc.org` + (`ap1` is fine, too) + +1. Create and change into a new directory for this exercise + +1. Download the Python script that is the main executable for this exercise: + + :::console + user@server $ wget https://raw.githubusercontent.com/osg-htc/user-school-2022/main/src/checkpointing/fibonacci.py + +1. If you want to run the script directly, make it executable first: + + :::console + user@server $ chmod 0755 fibonacci.py + +Take a look at the code, if you like. +It is not very elegant, but it gets the job done. + +A few notes: + +* The script takes a single argument, the number of iterations to run. + To minimize computing time while leaving time to explore, `10` is a good number of iterations. + +* The script checkpoints every other iteration through the main loop. + The exit status code for a checkpoint is 85. + +* It prints some output to standard out along the way, to let you know what is going on. + +* The final result is written to a separate file named `fibonacci.result`. + This file does not exist until the very end of the complete run. + +* It is safe to run from the command line on an access point: + + :::console + user@server $ ./fibonacci.py 10 + + If you run it, what happens? (Due to the 30-second delay, be patient.) + Can you explain its behavior? + What happens if you run it again, without changing any files in between? Why? + +## Preparing to run + +Now you have an executable and you know how to run it. +It is time to prepare it for submission to HTCondor! + +Using what you know about the script (above), +and using information in the slides from today, +try writing a submit file that runs this software and +implements exit-driven self-checkpointing. +The Python code itself is ready and should not need any changes. + +Just use a plain `queue` statement, one job is enough to experiment on. + +**Before you submit,** read the next section first! + +## Running and monitoring + +With the 30-second delay per iteration in the code and the suggested 10 iterations, +once the script starts running you have about 5 minutes of runtime in which to see what is going on. +So it may help to read through this section *and then* return here and submit your job. + +If your job has problems or finishes before you have the chance to do all the steps below, +just remove the extra files (besides the Python script and your submit file) and try again! + +### Submission and first checkpoint + +1. Submit the job +1. Look at the contents of the submit directory — what changed? +1. Start watching the log file: `tail -n 100 -f YOUR-LOG-FILENAME.log` + +Be patient! As HTCondor adds more lines to the end of your log file, they will appear automatically. +Thus, nothing much will happen until HTCondor starts running your job. +When it does, you will see three sets of messages in the log file quickly: + +* `Started transferring input files` +* `Finished transferring input files` +* `Job executing on host:` + +(Of course, each message will contain a lot of other characters!) + +Now wait about 1 minute, and you should see two more messages appear: + +* `Started transferring output files` +* `Finished transferring output files` + +That is the first checkpoint happening! + +### Forcing your job to stop running + +Now, assuming that your job is still running (check `condor_q` again), +you can force HTCondor to remove (*evict*) your job before it finishes: + +1. Run `condor_q` to get the job ID of the running job +1. Run `condor_vacate_job JOB_ID`, where you replace `JOB_ID` with your job ID from above +1. Monitor the action again by running `tail -n 100 -f YOUR-LOG-FILENAME.log` + +### Finishing the job and wrap-up + +Be patient again! +You removed your running job, and so HTCondor put it back in the queue as idle. +If you wait a minute or two, you should see that HTCondor starts running the job again. + +1. In the log file, look carefully for the two `Job executing on host:` messages. + Does it seem like you ran on the same computer again or on a different one? + Both are possible! + +1. Let your job finish running this time. + There should be a `Job terminated of its own accord` message near the end. + +1. Did you get results? Go through all the files and see what they contain. + The log and output files are probably the most interesting. + But did you get a result file, too? + +Did the output file — +that is, whatever file you named in the `output` line of your submit file — +contain *everything* that you expected it to? + +## Conclusion + +This has been a brief and simple tour of self-checkpointing. +If you would like to learn more, please read +[the Self-Checkpointing Applications section](https://htcondor.readthedocs.io/en/latest/users-manual/self-checkpointing-applications.html) +of the HTCondor Manual. +Or talk to School staff about it. +Or contact support@osg-htc.org for further help at any time. diff --git a/docs/materials/data/files/osgus18-day4-part2-ex2-data-transfer.jpg b/docs/materials/data/files/osgus18-day4-part2-ex2-data-transfer.jpg new file mode 100644 index 0000000..91121c2 Binary files /dev/null and b/docs/materials/data/files/osgus18-day4-part2-ex2-data-transfer.jpg differ diff --git a/docs/materials/data/files/osgus19-day4-part2-CacheLocations.png b/docs/materials/data/files/osgus19-day4-part2-CacheLocations.png new file mode 100644 index 0000000..3fd18ec Binary files /dev/null and b/docs/materials/data/files/osgus19-day4-part2-CacheLocations.png differ diff --git a/docs/materials/data/files/osgus23-data.pdf b/docs/materials/data/files/osgus23-data.pdf new file mode 100644 index 0000000..2285283 Binary files /dev/null and b/docs/materials/data/files/osgus23-data.pdf differ diff --git a/docs/materials/data/files/osgus23-data.pptx b/docs/materials/data/files/osgus23-data.pptx new file mode 100644 index 0000000..22812df Binary files /dev/null and b/docs/materials/data/files/osgus23-data.pptx differ diff --git a/docs/materials/data/part1-ex1-data-needs.md b/docs/materials/data/part1-ex1-data-needs.md new file mode 100644 index 0000000..6290da0 --- /dev/null +++ b/docs/materials/data/part1-ex1-data-needs.md @@ -0,0 +1,171 @@ +--- +status: testing +--- + +Data Exercise 1.1: Understanding Data Requirements +=============================== + +Exercise Goal +------------- + +This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a +large batch of jobs or using tools for delivering large data to jobs. +In this exercise we will attempt to understand the input and output of the bioinformatics application +[BLAST](http://blast.ncbi.nlm.nih.gov/). + +Setup +----- + +For this exercise, we will use the `ap40.uw.osg-htc.org` access point. Log in: + +``` hl_lines="1" +$ ssh @ap40.uw.osg-htc.org +``` + +Create a directory for this exercise named `blast-data` and change into it + +### Copy the Input Files ### + +To run BLAST, we need the executable, input file, and reference +database. For this example, we'll use the "pdbaa" database, which +contains sequences for the protein structure from the Protein Data Bank. +For our input file, we'll use an abbreviated fasta file with mouse +genome information. + +1. Copy the BLAST executables: + + :::console + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ncbi-blast-2.12.0+-x64-linux.tar.gz + user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz + +1. Download these files to your current directory: + + :::console + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/pdbaa.tar.gz + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/mouse.fa + +1. Untar the `pdbaa` database: + + :::console + user@ap40 $ tar -xzvf pdbaa.tar.gz + +Understanding BLAST +------------------- + +Remember that `blastx` is executed in a command like the following: + +``` console +user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db -query -out +``` + +In the above, the `` is the name of a file containing a number of genetic sequences (e.g. `mouse.fa`), and +the database that these are compared against is made up of several files that begin with the same ``, +(e.g. `pdbaa/pdbaa`). +The output from this analysis will be printed to `` that is also indicated in the command. + +Calculating Data Needs +---------------------- + +Using the files that you prepared in `blast-data`, we will calculate how much disk space is needed if we were to +run a hypothetical BLAST job with a wrapper script, where the job: + +- Transfers all of its input files (including the executable) as tarballs +- Untars the input files tarballs on the execute host +- Runs `blastx` using the untarred input files + +Here are some commands that will be useful for calculating your job's storage needs: + +- List the size of a specific file: + + :::console + user@ap40 $ ls -lh + +- List the sizes of all files in the current directory: + + :::console + user@ap40 $ ls -lh + +- Sum the size of all files in a specific directory: + + :::console + user@ap40 $ du -sh + +### Input requirements + +Total up the amount of data in all of the files necessary to run the `blastx` wrapper job, including the executable itself. +Write down this number. +Also take note of how much total data is in the `pdbaa` directory. + +!!! note "Compressed Files" + Remember, `blastx` reads the un-compressed `pdbaa` files. + +### Output requirements + +The output that we care about from `blastx` is saved in the file whose name is indicated after the `-out` argument to +`blastx`. +Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too. +Are there any other files? +Total all of these together, as well. + + + + + +Up next! +-------- + +Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. +[Next Exercise](../part1-ex2-file-transfer) diff --git a/docs/materials/data/part1-ex2-file-transfer.md b/docs/materials/data/part1-ex2-file-transfer.md new file mode 100644 index 0000000..9ce65e6 --- /dev/null +++ b/docs/materials/data/part1-ex2-file-transfer.md @@ -0,0 +1,227 @@ +--- +status: testing +--- + +Data Exercise 1.2: transfer\_input\_files, transfer\_output\_files, and remaps +================================================== + +Exercise Goal +------------- + +The objective of this exercise is to refresh yourself on HTCondor file +transfer, to implement file compression, and to begin examining the +memory and disk space used by your jobs in order to plan larger batches. +We will also explore ways to deal with output data. + +Setup +----- + +The executable we'll use in this exercise and later today is the same +`blastx` executable from previous exercises. Log in to ap40: + +``` hl_lines="1" +$ ssh @ap40.uw.osg-htc.org +``` + +Then change into the `blast-data` folder that you created in the +previous exercise. + +### Review: HTCondor File Transfer + +![OSG data transfer](../files/osgus18-day4-part2-ex2-data-transfer.jpg) + +Recall that OSG does **NOT** have a shared filesystem! Instead, +HTCondor *transfers* your executable and input files (specified with +the `executable` and `transfer_input_files` submit file directives, +respectively) to a working directory on the execute node, regardless of +how these files were arranged on the submit node. In this exercise we'll +use the same `blastx` example job that we used previously, but modify +the submit file and test how much memory and disk space it uses on the +execute node. + +Start with a test submit file +----------------------------- + +We've started a submit file for you, below, which you'll add to in the remaining steps. + +``` file +executable = +transfer_input_files = +output = test.out +error = test.err +log = test.log +request_memory = +request_disk = +request_cpus = 1 +requirements = (OSGVO_OS_STRING == "RHEL 8") +queue +``` + +### Implement file compression + +In our first blast job from the Software exercises ([1.1](../../software/part1-ex1-download)), the database files in the `pdbaa` directory were all transferred, as is, but we +could instead transfer them as a single, compressed file using `tar`. +For this version of the job, let's compress our blast database files to send them to the submit node as a single +`tar.gz` file (otherwise known as a tarball), by following the below steps: + +1. Change into the `pdbaa` directory and compress the database files into a single file called `pdbaa_files.tar.gz` + using the `tar` command. + Note that this file will be different from the `pdbaa.tar.gz` file that you used earlier, because it will only + contain the `pdbaa` files, and not the `pdbaa` directory, itself.) + + Remember, a typical command for creating a tar file is: + + :::console + user@ap40 $ tar -cvzf + + + Replacing `` with the name of the tarball that you would like to create and + `` with a space-separated list of files and/or directories that you want inside pdbaa_files.tar.gz. + Move the resulting tarball to the `blast-data` directory. + +2. Create a wrapper script that will first decompress the `pdbaa_files.tar.gz` file, and then run blast. + + Because this file will now be our `executable` in the submit file, we'll also end up transferring the `blastx` executable + with `transfer_input_files`. + In the `blast-data` directory, create a new file, called `blast_wrapper.sh`, with the following contents: + + :::file + #!/bin/bash + + tar -xzvf pdbaa_files.tar.gz + + ./blastx -db pdbaa -query mouse.fa -out mouse.fa.result + + rm pdbaa.* + + Also remember to make the script executable: `chmod +x blast_wrapper.sh` + + !!! warning "Extra Files!" + The last line removes the resulting database files that came from `pdbaa_files.tar.gz`, as these files would + otherwise be copied back to the submit server as perceived output since they're "new" files that HTCondor + didn't transfer over as input. + +### List the executable and input files + +Make sure to update the submit file with the following: + +- Add the new `executable` (the wrapper script you created above) +- In `transfer_input_files`, list the `blastx` binary, the `pdbaa_files.tar.gz` file, and the input query file. + +!!! note "Commas, commas everywhere!" + Remember that `transfer_input_files` accepts a comma separated list of files, and that you need to list the full + location of the `blastx` executable (`blastx`). + There will be no arguments, since the arguments to the `blastx` command are now captured in the wrapper script. + +### Predict memory and disk requests from your data + +Also, think about how much memory and disk to request for this job. +It's good to start with values that are a little higher than you think a test job will need, but think about: + +- How much memory `blastx` would use if it loaded all of the database files *and* the query input file into memory. +- How much disk space will be necessary on the execute server for the executable, all input files, and all output + files (hint: the log file only exists on the submit node). +- Whether you'd like to request some extra memory or disk space, just in case + +Look at the `log` file for your `blastx` job from Software exercise ([1.1](../../software/part1-ex1-download)), and compare the memory and disk "Usage" to what you predicted +from the files. +Make sure to update the submit file with more accurate memory and disk requests. You may still want to request slightly +more than the job actually used. + +Run the test job +---------------- + +Once you have finished editing the submit file, go ahead and submit the job. +It should take a few minutes to complete, and then you can check to make sure that no unwanted files (especially the +`pdbaa` database files) were copied back at the end of the job. + +Run a **`du -sh`** on the directory with this job's input. +How does it compare to the directory from Software exercise ([1.1](../../software/part1-ex1-download)), and why? + +transfer\_output\_files +----------------------- + +So far, we have used HTCondor's new file detection to transfer back +the newly created files. An alternative is to be explicit, using the +`transfer_output_files` attribute in the submit file. The upside to this +approach is that you can pick to only transfer back a subset of the +created files. The downside is that you have to know which files are +created. + +The first exercise is to modify the submit file from the previous +example, and add a line like (remember, before the `queue`): + + :::file + transfer_output_files = mouse.fa.result + +You may also remove the last line in the `blast_wrapper.sh`, the +`rm pdbaa.*` as extra files are no longer an issue - those files +will be ignored because we used `transfer_output_files`. + +Submit the job, and make sure everything works. Did you get +any `pdbaa.*` files back? + +The next thing we should try is to see what happens if the +file we specify does not exist. Modify your submit file, +and change the `transfer_output_files` to: + + :::file + transfer_output_files = elephant.fa.result + +Submit the job and see how it behaves. Did it finish successfully? + +transfer\_output\_remaps +------------------------ + +Related to `transfer_output_files` is `transfer_output_remaps`, +which allows us to rename outputs, or map the outputs to +a different storage system (will be explored in the next +module). + +The format of the `transfer_output_remaps` attribute is a +list of remaps, each remap taking the form of `src=dst`. +The destination can be a local path, or a URL. For example: + + :::file + transfer_output_remaps = "myresults.dat = s3://destination-server.com/myresults.dat" + +If you have more than one remap, you can separate them with +`;` + +By now, your `blast-data` directory is probably starting +to look messy with a mix of submit files, input data, +log file and output data all intermingled. One improvement +could be to map our outputs to a separate directory. Create +a new directory named `science-results`. + +Add a `transfer_output_remaps` line to the submit file. +It is common to place this line right after the +`transfer_output_files` line. Change the +`transfer_output_files` back to `mouse.fa.result`. +Example: + + ::file + transfer_output_files = mouse.fa.result + transfer_output_remaps = + +Fill out the remap line, mapping `mouse.fa.result` to the +destination `science-results/mouse.fa.result`. Remember +that the `transfer_output_remaps` value requires double +quotes around it. + +Submit the job, and wait for it to complete. Was there +any errors? Can you find mouse.fa.result? + +Conclusions +----------- + +In this exercise, you: + +- Used your data requirements knowledge from the [previous exercise](../part1-ex1-data-needs) to write a job. +- Executed the job on a remote worker node and took note of the data usage. +- Used `transfer_input_files` to transfer inputs +- Used `transfer_output_files` to transfer outputs +- Used `transfer_output_remaps` to map outputs to a different destination + +When you've completed the above, continue with the [next exercise](../part1-ex3-blast-split). + diff --git a/docs/materials/data/part1-ex3-blast-split.md b/docs/materials/data/part1-ex3-blast-split.md new file mode 100644 index 0000000..bb7a724 --- /dev/null +++ b/docs/materials/data/part1-ex3-blast-split.md @@ -0,0 +1,134 @@ +--- +status: testing +--- + +Data Exercise 1.3: Splitting Large Input for Better Throughput +================================================================== + + +The objective of this exercise is to prepare for _blasting_ a much larger input query file by splitting the input for +greater throughput and lower memory and disk requirements. +Splitting the input will also mean that we don't have to rely on additional large-data measures for the input query files. + +Setup +----- + +1. Log in to `ap40.uw.osg-htc.org` + +1. Create a directory for this exercise named `blast-split` and change into it. +1. Copy over the following files from the [previous exercise](../part1-ex2-file-transfer): + - Your submit file + - `blastx` + - `pdbaa_files.tar.gz` + - `blast_wrapper.sh` +1. Remember to modify the submit file for the new locations of the above files. + +### Obtain the large input + +We've previously used `blastx` to analyze a relatively small input file of test data, `mouse.fa`, but let's imagine that +you now need to blast a much larger dataset for your research. +This dataset can be downloaded with the following command: + +``` console +user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/mouse_rna.tar.gz +``` + +After un-tar'ing (`tar xzf mouse_rna.tar.gz`) the file, you should be able to confirm that it's size is roughly 100 MB. +Not only is this near the size cutoff for HTCondor file transfer, it would take hours to complete a single `blastx` +analysis for it and the resulting output file would be huge. + +### Split the input file + +For `blast`, it's scientifically valid to split up the input query file, analyze the pieces, and then put the results +back together at the end! +On the other hand, BLAST databases should not be split, because the `blast` output includes a score value for each +sequence that is calculated relative to the entire length of the database. + +Because genetic sequence data is used heavily across the life sciences, there are also tools for splitting up the data +into smaller files. +One of these is called [genome tools](http://genometools.org/), and you can download a package of precompiled binaries +(just like BLAST) using the following command: + +``` console +user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/gt-1.5.10-Linux_x86_64-64bit-complete.tar.gz +``` + +Un-tar the gt package (`tar -xzvf ...`), then run its sequence file splitter as follows, with the target file size of 1MB: + +``` console +user@ap40 $ ./gt-1.5.10-Linux_x86_64-64bit-complete/bin/gt splitfasta -targetsize 1 mouse_rna.fa +``` + +You'll notice that the result is a set of 100 files, all about the size of 1 MB, and numbered 1 through 100. + +Run a Jobs on Split Input +-------------------- + +Now, you'll submit jobs on the split input files, where each job will use a different piece of the large original input file. + +### Modify the submit file + +First, you'll create a new submit file that passes the input filename as an argument and use a list of applicable +filenames. +Follow the below steps: + +1. Copy the submit file from the [previous exercise](../part1-ex2-file-transfer) to a new file called `blast_split.sub` and modify the "queue" line of the submit file to the + following: + + queue inputfile matching mouse_rna.fa.* + + +2. Replace the `mouse.fa` instances in the submit file with `$(inputfile)`, and rename the output, log, and error files + to use the same `inputfile` variable: + + output = $(inputfile).out + error = $(inputfile).err + log = $(inputfile).log + +3. Add an `arguments` line to the submit file so it will pass the name of the input file to the wrapper script + + arguments = $(inputfile) + +4. Add the `$(inputfile)` to the end of your list of `transfer_input_files`: + + transfer_input_files = ... , $(inputfile) + +5. Update the memory and disk requests, since the new input file is larger and will also produce larger output. + It may be best to overestimate to something like 1 GB for each. + +### Modify the wrapper file + +Replace instances of the input file name in the `blast_wrapper.sh` script so that it will insert the first argument in +place of the input filename, like so: + +``` file +./blastx -db pdbaa -query $1 -out $1.result +``` + +!!! note + Bash shell scripts will use the first argument in place of `$1`, the second argument as `$2`, etc. + +### Submit the jobs + +This job will take a bit longer than the job in the last exercise, since the input file is larger (by about 3-fold). +Again, make sure that only the desired `output`, `error`, and `result` files come back at the end of the job. +In our tests, the jobs ran for ~15 minutes. + +!!! warning "Jobs on jobs!" + Be careful to not submit the job again. Why? Our queue statement says `... matching mouse_rna.fa.*`, and look at + the current directory. + There are new files named `mouse_rna.fa.X.log` and other files. + Submitting again, the `queue` statement would see these new files, and try to run blast on them! + + If you want to remove all of the extra files, you can try: + + :::console + user@ap40 $ rm *.err *.log *.out *.result + +Update the resource requests +---------------------------- + +After the job finishes successfully, examine the `log` file for memory and disk usage, and update the requests in the +submit file. + + diff --git a/docs/materials/data/part2-ex1-osdf-inputs.md b/docs/materials/data/part2-ex1-osdf-inputs.md new file mode 100644 index 0000000..bc2e3e5 --- /dev/null +++ b/docs/materials/data/part2-ex1-osdf-inputs.md @@ -0,0 +1,97 @@ +--- +status: reviewed +--- + +Data Exercise 2.1: Using OSDF for Large Shared Data +=================================================== + +This exercise will use a [BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome) workflow to +demonstrate the functionality of OSDF for transferring input files to jobs on OSG. + +Because our individual blast jobs from previous exercises would take a bit longer +with a larger database (too long for an workable exercise), we'll imagine for this exercise that our +`pdbaa_files.tar.gz` file is too large for `transfer_input_files` (larger than ~1 GB). +For this exercise, we will use the same inputs, but instead of using `transfer_input_files` for the `pdbaa` database, +we will place it in OSDF and have the jobs download from there. + +OSDF is connected to a distributed set of caches spread across the U.S. +They are connected with high bandwidth connections to each other, and to the data origin servers, where your data is +originally placed. + +![OSDF Map](../files/osgus19-day4-part2-CacheLocations.png) + +Setup +----- + +- Make sure you're logged in to `ap40.uw.osg-htc.org` +- Copy the following files from the previous Blast exercises to a new directory in `/home/` called `osdf-shared`: + - `blast_wrapper.sh` + - `blastx` + - `mouse_rna.fa.1` + - `mouse_rna.fa.2` + - `mouse_rna.fa.3` + - Your most recent submit file (probably named `blast_split.sub`) + +Place the Database in OSDF +-------------------------------- + +### Copy to your data to the OSDF space + +OSDF provides a directory for you to store data which can be accessed through the caching servers. +First, you need to move your BLAST database (`pdbaa_files.tar.gz`) into this directory. For `ap40.uw.osg-htc.org`, the directory +to use is `/ospool/PROTECTED/[USERNAME]/` + +As the `PROTECTED` directory name indicates, your files placed in the directory will only be accessible +by your own jobs. + +Modify the Submit File and Wrapper +---------------------------------- + +You will have to modify the wrapper and submit file to use OSDF: + +1. HTCondor knows how to do OSDF transfers, so you just have to provide the correct URL in + `transfer_input_files`. Note there is no servername (3 slashes in :///) and we instead + is is just based on namespace (`/ospool/PROTECTED` in this case): + + ::file + transfer_input_files = blastx, $(inputfile), osdf:///ospool/PROTECTED//pdbaa_files.tar.gz + +1. Confirm that your queue statement is correct for the current directory. It should be something like: + + ::file + queue inputfile matching mouse_rna.fa.* + +And that `mouse_rna.fa.*` files exist in the current directory (you should have copied a few them from the previous exercise +directory). + +Submit the Job +-------------- + +Now submit and monitor the job! If your 100 jobs from the previous exercise haven't started running yet, this job will +not yet start. +However, after it has been running for ~2 minutes, you're safe to continue to the next exercise! + +Considerations +-------------- + +1. Why did we not place all files in OSDF (for example, `blastx` and `mouse_rna.fa.*`)? + +1. What do you think will happen if you make changes to `pdbaa_files.tar.gz`? Will the caches + be updated automatically, or is there a possiblility that the old version of + `pdbaa_files.tar.gz` will be served up to jobs? What is the solution to this problem? + (Hint: OSDF only considers the filename when caching data) + +Note: Keeping OSDF 'Clean' +-------------------------------- + +Just as for any data directory, it is VERY important to remove old files from OSDF when you no longer need them, +especially so that you'll have plenty of space for such files in the future. +For example, you would delete (`rm`) files from `/ospool/PROTECTED/` on when you don't need them there +anymore, but only after all jobs have finished. +The next time you use OSDF after the school, remember to first check for old files that you can delete. + +Next exercise +------------- + +Once completed, move onto the next exercise: [Using OSDF for outputs](../part2-ex2-osdf-outputs) + diff --git a/docs/materials/data/part2-ex2-osdf-outputs.md b/docs/materials/data/part2-ex2-osdf-outputs.md new file mode 100644 index 0000000..750412f --- /dev/null +++ b/docs/materials/data/part2-ex2-osdf-outputs.md @@ -0,0 +1,216 @@ +--- +status: reviewed +--- + +Data Exercise 2.2: Using OSDF for outputs +========================================================= + +In this exercise, we will run a multimedia program that converts and manipulates video files. +In particular, we want to convert large `.mov` files to smaller (10-100s of MB) `mp4` files. +Just like the Blast database in the [previous exercise](../part2-ex1-osdf-inputs), these video +files are potentially too large to send to jobs using HTCondor's default file transfer for +inputs/outputs, so we will use OSDF. + +Data +---- + +To get the exercise set up: + +1. Log into `ap40.uw.osg-htc.org` + +1. Create a directory for this exercise named `osdf-outputs` and change into it. + +1. Download the input data and store it under the OSDF directory (`cd` to that + directory first): + + :::console + user@ap40 $ cd /ospool/PROTECTED/[USERNAME]/ + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ducks.mov + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/teaching.mov + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/test_open_terminal.mov + +1. We're going to need a list of these files later. Below is the final list of movie files. + `cd` back to your `osdf-outputs` directory and create a file named `movie_list.txt`, + with the following content: + + :::file + ducks.mov + teaching.mov + test_open_terminal.mov + +Software +-------- + +We'll be using a multi-purpose media tool called `ffmpeg` to convert video formats. +The basic command to convert a file looks like this: + +``` console +user@ap40 $ ./ffmpeg -i input.mov output.mp4 +``` + +In order to resize our files, we're going to manually set the video bitrate and resize the frames, so that the resulting +file is smaller. + +``` console +user@ap40 $ ./ffmpeg -i input.mp4 -b:v 400k -s 640x360 output.mp4 +``` + +To get the `ffmpeg` binary do the following: + +1. We'll be downloading the `ffmpeg` pre-built static binary originally from this page: . + + :::console + user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ffmpeg-release-64bit-static.tar.xz + +1. Once the binary is downloaded, un-tar it, and then copy the main `ffmpeg` program into your current directory: + + :::console + user@ap40 $ tar -xf ffmpeg-release-64bit-static.tar.xz + user@ap40 $ cp ffmpeg-4.0.1-64bit-static/ffmpeg ./ + +Script +------ + +We want to write a script that runs on the worker node that uses `ffmpeg` to convert a `.mov` file to a smaller format. +Our script will need to run the proper executable. Create a file called `run_ffmpeg.sh`, that does the steps described above. +Use the name of the smallest `.mov` file in the `ffmpeg` command. +An example of that script is below: + + :::bash + #!/bin/bash + + ./ffmpeg -i test_open_terminal.mov -b:v 400k -s 640x360 test_open_terminal.mp4 + +Ultimately we'll want to submit several jobs (one for each `.mov` file), but to start with, we'll run one job to make +sure that everything works. + +Submit File +----------- + +Create a submit file for this job, based on other submit files from the school. Things to consider: + +1. We'll be copying the video file into the job's working directory from OSDF, so make sure to request enough disk space for the + input `mov` file and the output `mp4` file. + If you're aren't sure how much to request, ask a helper. + +1. Add the same requirements as the previous exercise: + + requirements = (OSGVO_OS_STRING == "RHEL 8") + +1. We need to transfer the `ffmpeg` program that we downloaded above, and the movie from OSDF: + + transfer_input_files = ffmpeg, osdf:///ospool/PROTECTED/[USERNAME]/test_open_terminal.mov + +1. Transfer outputs via OSDF. This requires a transfer remap: + + transfer_output_files = test_open_terminal.mp4 + transfer_output_remaps = "test_open_terminal.mp4 = osdf:///ospool/PROTECTED/[USERNAME]/test_open_terminal.mp4" + + +Initial Job +----------- + +With everything in place, submit the job. Once it finishes, we should check to make sure everything ran as expected: + +1. Check the OSDF directory. Did the output `.mp4` file return? +3. Check file sizes. How big is the returned `.mp4` file? How does that compare to the original `.mov` input? + +If your job successfully returned the converted `.mp4` file and did **not** transfer the `.mov` file to the submit +server, and the `.mp4` file was appropriately scaled down, then we can go ahead and convert all of the files we uploaded +to OSDF. + +Multiple jobs +------------- + +We wrote the name of the `.mov` file into our `run_ffmpeg.sh` executable script. +To submit a set of jobs for all of our `.mov` files, what will we need to change in: + +1. The script? +1. The submit file? + +Once you've thought about it, check your reasoning against the instructions below. + +### Add an argument to your script + +**Look at your `run_ffmpeg.sh` script. What values will change for every job?** + +The input file will change with every job - and don't forget that the output file will too! Let's make them both into +arguments. + +To add arguments to a bash script, we use the notation `$1` for the first argument (our input file) and `$2` for the +second argument (our output file name). +The final script should look like this: + +``` file +#!/bin/bash + +./ffmpeg -i $1 -b:v 400k -s 640x360 $2 +``` + +Note that we use the input file name multiple times in our script, so we'll have to use `$1` multiple times as well. + +### Modify your submit file + +1. We now need to tell each job what arguments to use. + We will do this by adding an arguments line to our submit file. + Because we'll only have the input file name, the "output" file name will be the input file name with the `mp4` + extension. + That should look like this: + + :::file + arguments = $(mov) $(mov).mp4 + +1. Update the `transfer_input_files` to have `$(mov)`: + + :::file + transfer_input_files = ffmpeg, osdf:///ospool/PROTECTED/[USERNAME]/$(mov) + +1. Similarly, update the output/remap with `$(mov).mp4`: + + :::file + transfer_output_files = $(mov).mp4 + transfer_output_remaps = "$(mov).mp4 = osdf:///ospool/PROTECTED/[USERNAME]/$(mov).mp4" + +1. To set these arguments, we will use the `queue .. from` syntax. + In our submit file, we can then change our queue statement to: + + queue mov from movie_list.txt + +Once you've made these changes, try submitting all the jobs! + +Bonus +----- + +If you wanted to set a different output file name, bitrate and/or size for each original movie, how could you modify: + +1. `movie_list.txt` +2. Your submit file +3. `run_ffmpeg.sh` + +to do so? + + +??? "Show hint" + + Here's the changes you can make to the various files: + + 1. `movie_list.txt` + + ducks.mov ducks.mp4 500k 1280x720 + teaching.mov teaching.mp4 400k 320x180 + test_open_terminal.mov terminal.mp4 600k 640x360 + + 1. Submit file + + arguments = $(mov) $(mp4) $(bitrate) $(size) + + queue mov,mp4,bitrate,size from movie_list.txt + + + 1. `run_ffmpeg.sh` + + #!/bin/bash + ./ffmpeg -i $1 -b:v $3 -s $4 $2 + + + diff --git a/docs/materials/facilitation/files/osgus23-facilitation-campuses.pdf b/docs/materials/facilitation/files/osgus23-facilitation-campuses.pdf new file mode 100644 index 0000000..c4f4f64 Binary files /dev/null and b/docs/materials/facilitation/files/osgus23-facilitation-campuses.pdf differ diff --git a/docs/materials/final/files/osgus23-day5-part6-forward-timc.pdf b/docs/materials/final/files/osgus23-day5-part6-forward-timc.pdf new file mode 100644 index 0000000..007db0c Binary files /dev/null and b/docs/materials/final/files/osgus23-day5-part6-forward-timc.pdf differ diff --git a/docs/materials/htcondor/files/.empty b/docs/materials/htcondor/files/.empty new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/docs/materials/htcondor/files/.empty @@ -0,0 +1 @@ + diff --git a/docs/materials/htcondor/files/osgus23-htc-htcondor-multiple-jobs.pdf b/docs/materials/htcondor/files/osgus23-htc-htcondor-multiple-jobs.pdf new file mode 100644 index 0000000..ba69798 Binary files /dev/null and b/docs/materials/htcondor/files/osgus23-htc-htcondor-multiple-jobs.pdf differ diff --git a/docs/materials/htcondor/files/osgus23-htc-htcondor.pdf b/docs/materials/htcondor/files/osgus23-htc-htcondor.pdf new file mode 100644 index 0000000..40203ea Binary files /dev/null and b/docs/materials/htcondor/files/osgus23-htc-htcondor.pdf differ diff --git a/docs/materials/htcondor/files/osgus23-htc-htcondor.pptx b/docs/materials/htcondor/files/osgus23-htc-htcondor.pptx new file mode 100644 index 0000000..863724f Binary files /dev/null and b/docs/materials/htcondor/files/osgus23-htc-htcondor.pptx differ diff --git a/docs/materials/htcondor/files/osgus23-htc-worksheet.pdf b/docs/materials/htcondor/files/osgus23-htc-worksheet.pdf new file mode 100644 index 0000000..750730a Binary files /dev/null and b/docs/materials/htcondor/files/osgus23-htc-worksheet.pdf differ diff --git a/docs/materials/htcondor/files/osgus23-intro-to-htc.pdf b/docs/materials/htcondor/files/osgus23-intro-to-htc.pdf new file mode 100644 index 0000000..7a09eb7 Binary files /dev/null and b/docs/materials/htcondor/files/osgus23-intro-to-htc.pdf differ diff --git a/docs/materials/htcondor/files/osgus23-intro-to-htc.pptx b/docs/materials/htcondor/files/osgus23-intro-to-htc.pptx new file mode 100644 index 0000000..33db495 Binary files /dev/null and b/docs/materials/htcondor/files/osgus23-intro-to-htc.pptx differ diff --git a/docs/materials/htcondor/part1-ex1-login.md b/docs/materials/htcondor/part1-ex1-login.md new file mode 100644 index 0000000..a1b0d1f --- /dev/null +++ b/docs/materials/htcondor/part1-ex1-login.md @@ -0,0 +1,107 @@ +--- +status: testing +--- + + + +HTC Exercise 1.1: Log In and Look Around +=========================================== + +Background +---------- + +There are different High Throughput Computing (HTC) systems at universities, government facilities, and other institutions around the world, and they may have different user experiences. For example, some systems have dedicated resources (which means your job will be guaranteed a certain amount of resources/time to complete), while other systems have opportunistic, backfill resources (which means your job can take advantage of some resources, but those resources could be removed at any time). Other systems have a mix of dedicated and opportunistic resources. + +Durring the OSG School, you will practice on two different HTC systems: the "[PATh Facility](https://path-cc.io/facility/)" and "[OSG's Open Science Pool (OSPool)](https://osg-htc.org/services/open_science_pool.html)". This will help prepare you for working on a variety of different HTC systems. + +* PATh Facility: The PATh Facility provides researchers with **dedicated HTC resources and the ability to run larger and longer jobs**. The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. +* OSG's Open Science Pool: The OSPool provides researchers with **opportunitistic resources and the ability to run many smaller and shorter jobs silmnulatinously**. The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs. + +Exercise Goal +--- + +The goal of this first exercise is to log in to the PATh Facility access point and look around a little bit, which will take only a few minutes. + +**If you have trouble getting SSH access to the submit server, ask the instructors right away! Gaining access is critical for all remaining exercises.** + +Logging In +---------- + +Today, you will use a High Throughput Computing system known as the "[PATh Facility](https://path-cc.io/facility/)". The PATh Facility provides users with dedicated resources and longer runtimes than OSG's Open Science Pool. + + +You will login to the access point of the PATh Facility, which is called `ap1.facility.path-cc.io` using the username you previously created. + +To log in, use a [Secure Shell](http://en.wikipedia.org/wiki/Secure_Shell) (SSH) client. + +- From a Mac or Linux computer, start the Terminal app and run the below `ssh` command, replacing with your username: + +``` hl_lines="1" +$ ssh @ap1.facility.path-cc.io +``` + +- On Windows, we recommend a free client called [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/), + but any SSH client should be fine. + +**If you need help finding or using an SSH client, ask the instructors for help right away**! + +Running Commands +---------------- + +In the exercises, we will show commands that you are supposed to type or copy into the command line, like this: + +``` console +username@ap1 $ hostname +ap1.facility.path-cc.io +``` + +!!! note + In the first line of the example above, the `username@ap1 $` part is meant to show the Linux command-line prompt. + You do not type this part! Further, your actual prompt probably is a bit different, and that is expected. + So in the example above, the command that you type at your own prompt is just the eight characters `hostname`. + The second line of the example, without the prompt, shows the output of the command; you do not type this part, + either. + +Here are a few other commands that you can try (the examples below do not show the output from each command): + +``` console +username@ap1 $ whoami +username@ap1 $ date +username@ap1 $ uname -a +``` + +A suggestion for the day: try typing into the command line as many of the commands as you can. +Copy-and-paste is fine, of course, but **you WILL learn more if you take the time to type each command yourself.** + +Organizing Your Workspace +------------------------- + +You will be doing many different exercises over the next few days, many of them on this access point. Each exercise may use many files, once finished. To avoid confusion, it may be useful to create a separate directory for each exercise. + +For instance, for the rest of this exercise, you may wish to create and use a directory named `intro-1.1-login`, or something like that. + +``` console +username@ap1 $ mkdir intro-1.1-login +username@ap1 $ cd intro-1.1-login +``` + +Showing the Version of HTCondor +------------------------------- + +HTCondor is installed on this server. But what version? You can ask HTCondor itself: + +``` console +username@ap1 $ condor_version +$CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ +$CondorPlatform: x86_64_AlmaLinux8 $ +``` + +As you can see from the output, we are using HTCondor 10.7.0. + + +Reference Materials +------------------- + +Here are a few links to reference materials that might be interesting after the school (or perhaps during). + +- [HTCondor manuals](https://htcondor.readthedocs.io/en/latest/); it is probably best to read the manual corresponding to the version of HTCondor that you use. That link points to the latest version of the manual, but you can switch versions using the toggle in the lower left corner of that page. diff --git a/docs/materials/htcondor/part1-ex2-commands.md b/docs/materials/htcondor/part1-ex2-commands.md new file mode 100644 index 0000000..47dca37 --- /dev/null +++ b/docs/materials/htcondor/part1-ex2-commands.md @@ -0,0 +1,187 @@ +--- +status: testing +--- + + + +HTC Exercise 1.2: Experiment With HTCondor Commands +=================================================== + +Exercise Goal +------------- + +The goal of this exercise is to learn about two very important HTCondor commands, `condor_q` and `condor_status`. +They will be useful for monitoring your jobs and available execute point slots (respectively) throughout the week. + +This exercise should take only a few minutes. + +Viewing Slots +------------- + +As discussed in the lecture, the `condor_status` command is used to view the current state of slots in an HTCondor pool. + +At its most basic, the command is: + +``` console +username@ap1 $ condor_status +``` + +When running this command, there is typically a lot of output printed to the screen. Looking at your terminal output, there is one line per execute point slot. **TIP: You can widen your terminal window, which may help you to see all details of the output better.** + +*Here is some example output (what you see will be longer):* + +``` console +slot1@FIU-PATH-EP.osgvo-docker-pilot-55c74f5b7c-kbs77 LINUX X86_64 Unclaimed Idle 0.000 8053 0+01:14:34 +slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n LINUX X86_64 Claimed Busy 0.930 1024 0+02:42:08 +slot1@WISC-PATH-EP.osgvo-docker-pilot-7b46dbdbb7-xqkkg LINUX X86_64 Claimed Busy 3.530 1024 0+02:40:24 +slot1@SYRA-PATH-EP.osgvo-docker-pilot-gpu-7f6c64d459 LINUX X86_64 Owner Idle 0.300 250 7+03:22:21 +``` + +This output consists of 8 columns: + +| Col | Example | Meaning | +|:-----------|:-----------------------------|:------------------------------------------------------------------------------------------------------------------------| +| Name | `slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n` | Full slot name (including the hostname) | +| OpSys | `LINUX` | Operating system | +| Arch | `X86_64` | Slot architecture (e.g., Intel 64 bit) | +| State | `Claimed` | State of the slot (`Unclaimed` is available, `Owner` is being used by the machine owner, `Claimed` is matched to a job) | +| Activity | `Busy` | Is there activity on the slot? | +| LoadAv | `0.930` | Load average, a measure of CPU activity on the slot | +| Mem | `1024` | Memory available to the slot, in MB | +| ActvtyTime | `0+02:42:08` | Amount of time spent in current activity (days + hours:minutes:seconds) | + +At the end of the slot listing, there is a summary. Here is an example: + +``` console + Machines Owner Claimed Unclaimed Matched Preempting Drain + + X86_64/LINUX 10831 0 10194 631 0 0 6 + X86_64/WINDOWS 2 2 0 0 0 0 0 + + Total 10833 2 10194 631 0 0 6 +``` + +There is one row of summary for each machine (i.e. "slot") architecture/operating system combination with columns for the number of slots in each state. The final row gives a summary of slot states for the whole pool. + +### Questions: + +- When you run `condor_status`, how many 64-bit Linux slots are available? (Hint: Unclaimed = available.) +- What percent of the total slots are currently claimed by a job? (Note: there is a rapid turnover of slots, which is what allows users with new submission to have jobs start quickly.) +- How have these numbers changed (if at all) when you run the `condor_status` command again? + +### Viewing Whole Machines, Only + +Also try out the `-compact` for a slightly different view of whole machines (i.e. server hostnames), without the individual slots shown. + +``` console +username@ap1 $ condor_status -compact +``` + +**How has the column information changed?** + +Viewing Jobs +------------ + +The `condor_q` command lists jobs that are on this access point machine and that are running or waiting to run. The `_q` part of the name is meant to suggest the word “queue”, or list of job sets *waiting* to finish. + +### Viewing Your Own Jobs + +The default behavior of the command lists only your jobs: + +``` console +username@ap1 $ condor_q +``` + +The main part of the output (which will be empty, because you haven't submitted jobs yet) shows one set ("batch") of submitted jobs per line. If you had a single job in the queue, it would look something like the below: + +``` console +-- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 09:59:31 +OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS +alice CMD: run_ffmpeg.sh 7/12 09:58 _ _ 1 1 18801.0 +``` + +This output consists of 8 (or 9) columns: + +| Col | Example | Meaning | +|:------------|:----------------|:-------------------------------------------------------------------------------------------------------------------------------| +| OWNER | `alice` | The user ID of the user who submitted the job | +| BATCH\_NAME | `run_ffmpeg.sh` | The executable or "jobbatchname" specified within the submit file(s) | +| SUBMITTED | `7/12 09:58` | The date and time when the job was submitted | +| DONE | `_` | Number of jobs in this batch that have completed | +| RUN | `_` | Number of jobs in this batch that are currently running | +| IDLE | `1` | Number of jobs in this batch that are idle, waiting for a match | +| HOLD | `_` | Column will show up if there are jobs on "hold" because something about the submission/setup needs to be corrected by the user | +| TOTAL | `1` | Total number of jobs in this batch | +| JOB\_IDS | `18801.0` | Job ID or range of Job IDs in this batch | + +At the end of the job listing, there is a summary. Here is a sample: + +``` console +1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended +``` + +It shows total counts of jobs in the different possible states. + +**Questions:** + +- For the sample above, when was the job submitted? +- For the sample above, was the job running or not yet? How can you tell? + +### Viewing Everyone’s Jobs + +By default, the `condor_q` command shows **your** jobs only. To see everyone’s jobs that are queued on the machine, add the `-all` option: + +``` console +username@ap1 $ condor_q -all +``` + +- How many jobs are queued in total (i.e., running or waiting to run)? +- How many jobs from this submit machine are running right now? + +### Viewing Jobs without the Default "batch" Mode + +The `condor_q` output, by default, groups "batches" of jobs together (if they were submitted with the same submit file or "jobbatchname"). To see more information for EVERY job on a separate line of output, use the `-nobatch` option to `condor_q`: + +``` console +username@ap1 $ condor_q -all -nobatch +``` + +**How has the column information changed?** (Below is an example of the top of the output.) + +``` console +-- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 11:58:44 + ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD +18203.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal +18204.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal +18801.0 alice 7/12 09:58 0+00:00:00 I 0 0.0 run_ffmpeg.sh +18997.0 s16_martincum 7/12 10:59 0+00:00:32 I 0 733.0 runR.pl 1_0 run_perm.R 1 0 10 +19027.5 s16_martincum 7/12 11:06 0+00:09:20 I 0 2198.0 runR.pl 1_5 run_perm.R 1 5 1000 +``` + +The `-nobatch` output shows a line for every job and consists of 8 columns: + +| Col | Example | Meaning | +|:----------|:----------------|:-------------------------------------------------------------------------------| +| ID | `18801.0` | Job ID, which is the `cluster`, a dot character (`.`), and the `process` | +| OWNER | `alice` | The user ID of the user who submitted the job | +| SUBMITTED | `7/12 09:58` | The date and time when the job was submitted | +| RUN\_TIME | `0+00:00:00` | Total time spent running so far (days + hours:minutes:seconds) | +| ST | `I` | Status of job: `I` is Idle (waiting to run), `R` is Running, `H` is Held, etc. | +| PRI | `0` | Job priority (see next lecture) | +| SIZE | `0.0` | Current run-time memory usage, in MB | +| CMD | `run_ffmpeg.sh` | The executable command (with arguments) to be run | + +**In future exercises, you'll want to switch between `condor_q` and `condor_q -nobatch` to see different types of information about YOUR jobs.** + +Extra Information +----------------- + +Both `condor_status` and `condor_q` have many command-line options, some of which significantly change their output. +You will explore a few of the most useful options in future exercises, but if you want to experiment now, go ahead! +There are a few ways to learn more about the commands: + +- Use the (brief) built-in help for the commands, e.g.: `condor_q -h` +- Read the installed man(ual) pages for the commands, e.g.: `man condor_q` +- Find the command in [the online manual](https://htcondor.readthedocs.io/en/latest/); **note:** the text online is the same as the `man` text, only formatted for the web + + diff --git a/docs/materials/htcondor/part1-ex3-jobs.md b/docs/materials/htcondor/part1-ex3-jobs.md new file mode 100644 index 0000000..c2f138c --- /dev/null +++ b/docs/materials/htcondor/part1-ex3-jobs.md @@ -0,0 +1,255 @@ +--- +status: testing +--- + + + +HTC Exercise 1.3: Run Jobs! +============================== + +Exercise Goal +------------- + +The goal of this exercise is to submit jobs to HTCondor and have them run on the PATh Facility. This is a huge step in learning to use an HTC system! + +**This exercise will take longer than the first two, short ones. If you are having any problems getting the jobs to run, please ask the instructors! It is very important that you know how to run jobs.** + +Running Your First Job +---------------------- + +Nearly all of the time, when you want to run an HTCondor job, you first write an HTCondor submit file for it. In this section, you will run the same `hostname` command as in Exercise 1.1, but where this command will run within a job on one of the 'execute' servers on the PATh Facility's HTCondor pool. + +First, create an example submit file called `hostname.sub` using your favorite text editor (e.g., `nano`, `vim`) and then transfer the following information to that file: + +``` file +executable = /bin/hostname + +output = hostname.out +error = hostname.err +log = hostname.log + +request_cpus = 1 +request_memory = 1GB +request_disk = 1GB + +queue +``` + +Save your submit file using the name `hostname.sub`. + +!!! note + You can name the HTCondor submit file using any filename. + It's a good practice to always include the `.sub` extension, but it is not required. + This is because the submit file is a simple text file that we are using to pass information to HTCondor. + +The lines of the submit file have the following meanings: + +| Submit Command | Explanation | +|:----------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `executable` | The name of the program to run (relative to the directory from which you submit). | +| `output` | The filename where HTCondor will write the standard output from your job. | +| `error` | The filename where HTCondor will write the standard error from your job. This particular job is not likely to have any, but it is best to include this line for every job. | +| `log` | The filename where HTCondor will write information about your job run. While not required, it is a **really** good idea to have a log file for every job. | +| `request_*` | Tells HTCondor how many `cpus` and how much `memory` and `disk` we want, which is not much, because the 'hostname' executable is very small. | +| `queue` | Tells HTCondor to run your job with the settings above. | + +Note that we are not using the `arguments` or `transfer_input_files` lines that were mentioned during lecture because the `hostname` program is all that needs to be transferred from the access point server, and we want to run it without any additional options. + +Double-check your submit file, so that it matches the text above. Then, tell HTCondor to run your job: + +``` console +username@ap1 $ condor_submit hostname.sub +Submitting job(s). +1 job(s) submitted to cluster NNNN. +``` + +The actual cluster number will be shown instead of `NNNN`. **If, instead of the text above, there are error messages, read them carefully and then try to correct your submit file or ask for help.** + +Notice that `condor_submit` returns back to the shell prompt right away. It does **not** wait for your job to run. Instead, as soon as it has finished submitting your job into the queue, the submit command finishes. + +### View your job in the queue + +Now, use `condor_q` and `condor_q -nobatch` to watch for your job in the queue! + +You may not even catch the job in the `R` running state, because the `hostname` command runs very quickly. When the job itself is finished, it will 'leave' the queue and no longer be listed in the `condor_q` output. + +After the job finishes, check for the `hostname` output in `hostname.out`, which is where job information printed to the terminal screen will be printed for the job. + +``` console +username@ap1 $ cat hostname.out +e171.chtc.wisc.edu +``` + +The `hostname.err` file should be empty, unless there were issues running the `hostname` executable after it was transferred to the slot. The `hostname.log` is more complex and will be the focus of a later exercise. + +## Running a Job With Arguments + +Very often, when you run a command on the command line, it includes arguments (i.e. options) after the program name, as in the below examples: + +``` console +username@ap1 $ sleep 60 +``` + +In an HTCondor submit file, the program (or 'executable') name goes in the `executable` statement and **all remaining arguments** go into an `arguments` statement. For example, if the full command is: + +``` console +username@ap1 $ sleep 60 +``` + +Then in the submit file, we would put the location of the "sleep" program (you can find it with `which sleep`) as the job `executable`, and `60` as the job `arguments`: + +``` file +executable = /bin/sleep +arguments = 60 +``` + +Let’s try a job submission with arguments. We will use the `sleep` command shown above, which does nothing (i.e., puts the job to sleep) for the specified number of seconds, then exits normally. It is convenient for simulating a job that takes a while to run. + +Create a new submit file and save the following text in it. + +``` file +executable = /bin/sleep +arguments = 60 + +output = sleep.out +error = sleep.err +log = sleep.log + +request_cpus = 1 +request_memory = 1GB +request_disk = 1GB + +queue +``` +You can save the file using any name, but as a reminder, we recommend it uses the `.sub` file extension. + +Except for changing a few filenames, this submit file is nearly identical to the last one, except for the addition of the `arguments` line. + +Submit this new job to HTCondor. Again, watch for it to run using `condor_q` and `condor_q -nobatch`; +check once every 15 seconds or so. +Once the job starts running, it will take about 1 minute to run (reminder: the `sleep` command is telling the job to do nothing for 60 seconds), +so you should be able to see it running for a bit. +When the job finishes, it will disappear from the queue, but there will be no output in the output or error files, because `sleep` does not produce any output. + +Running a Script Job From the Submit Directory +---------------------------------------------- + +So far, we have been running programs (executables) that come with the standard Linux system. +More frequently, you will want to run a program that exists within your directory +or perhaps a shell script of commands that you'd like to run within a job. In this example, you will write a shell script and a submit file that runs the shell script within a job: + +1. Put the following contents into a file named `test-script.sh`: + + :::bash + #!/bin/sh + # START + echo 'Date: ' `date` + echo 'Host: ' `hostname` + echo 'System: ' `uname -spo` + echo "Program: $0" + echo "Args: $*" + echo 'ls: ' `ls` + # END + +1. Add executable permissions to the file (so that it can be run as a program): + + :::console + username@ap1 $ chmod +x test-script.sh + +1. Test your script from the command line: + + :::console + username@ap1 $ ./test-script.sh hello 42 + Date: Mon Jul 17 10:02:20 CDT 2017 + Host: learn.chtc.wisc.edu + System: Linux x86_64 GNU/Linux + Program: ./test-script.sh + Args: hello 42 + ls: hostname.sub montage hostname.err hostname.log hostname.out test-script.sh + + This step is **really** important! If you cannot run your executable from the command-line, HTCondor probably cannot run it on another machine, either. + Further, debugging problems like this one is surprisingly difficult. + So, if possible, test your `executable` and `arguments` as a command at the command-line first. + +1. Write the submit file (this should be getting easier by now): + + :::file + executable = test-script.sh + arguments = foo bar baz + + output = script.out + error = script.err + log = script.log + + request_cpus = 1 + request_memory = 1GB + request_disk = 1GB + + queue + + In this example, the `executable` that was named in the submit file did **not** start with a `/`, + so the location of the file is relative to the submit directory itself. + In other words, in this format the executable must be in the same directory as the submit file. + + !!! note + Blank lines between commands and spaces around the `=` do not matter to HTCondor. + For example, this submit file is equivalent to the one above: + + :::file + executable = test-script.sh + arguments = foo bar baz + output = script.out + error = script.err + log = script.log + + request_cpus=1 + request_memory=1GB + request_disk=1GB + + queue + + Use whitespace to make things clear to **you**, the user. + +1. Submit the job, wait for it to finish, and check the standard output file (and standard error file, which should be empty). + + What do you notice about the lines returned for "Program" and "ls"? Remember that only files pertaining + to **this** job will be in the job working directory on the execute point server. You're also seeing the effects + of HTCondor's need to standardize some filenames when running your job, though they are named as you expect + in the submission directory (per the submit file contents). + +## Extra Challenge + +!!! note + There are Extra Challenges throughout the school curriculum. You may be better off coming back to these after you've completed all other exercises for your current working session. + +Below is a Python script that does something similar to the shell script above. Run this Python script using HTCondor. + +```python +#!/usr/bin/env python3 + +"""Extra Challenge for OSG School +Written by Tim Cartwright +Submitted to CHTC by #YOUR_NAME# +""" + +import getpass +import os +import platform +import socket +import sys +import time + +arguments = None +if len(sys.argv) > 1: + arguments = '"' + ' '.join(sys.argv[1:]) + '"' + +print(__doc__, file=sys.stderr) +print('Time :', time.strftime('%Y-%m-%d (%a) %H:%M:%S %Z')) +print('Host :', getpass.getuser(), '@', socket.gethostname()) +uname = platform.uname() +print("System :", uname[0], uname[2], uname[4]) +print("Version :", platform.python_version()) +print("Program :", sys.executable) +print('Script :', os.path.abspath(__file__)) +print('Args :', arguments) +``` diff --git a/docs/materials/htcondor/part1-ex4-logs.md b/docs/materials/htcondor/part1-ex4-logs.md new file mode 100644 index 0000000..d5a1bc7 --- /dev/null +++ b/docs/materials/htcondor/part1-ex4-logs.md @@ -0,0 +1,142 @@ +--- +status: testing +--- + + + +HTC Exercise 1.4: Read and Interpret Log Files +================================================= + +Exercise Goal +----------------- + +The goal of this exercise is to learn how to understand the contents of a job's log file, which is essentially a "history" of the steps HTCondor took to run your job. +If you suspect something has gone wrong with your job, the log is the a great place to start looking for indications of whether things might have gone wrong (in addition to the .err file). + +This exercise is short, but you'll want to at least read over it before moving on. + +Reading a Log File +------------------ + +For this exercise, we can examine a log file for any previous job that you have run. The example output below is based on the `sleep 60` job. + +A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are **all** of the event headings from the `sleep` job log (detailed output in between headings has been omitted here): + +``` file +000 (5739.000.000) 2023-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> +040 (5739.000.000) 2023-07-10 10:45:10 Started transferring input files +040 (5739.000.000) 2023-07-10 10:45:10 Finished transferring input files +001 (5739.000.000) 2023-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> +006 (5739.000.000) 2023-07-10 10:45:20 Image size of job updated: 72 +040 (5739.000.000) 2023-07-10 10:45:20 Started transferring output files +040 (5739.000.000) 2023-07-10 10:45:20 Finished transferring output files +006 (5739.000.000) 2023-07-10 10:46:11 Image size of job updated: 4072 +005 (5739.000.000) 2023-07-10 10:46:11 Job terminated. +``` + +There is a lot of extra information in those lines, but you can see: + +- The job ID: cluster 5739, process 0 (written `000`) +- The date and local time of each event +- A brief description of the event: submission, execution, some information updates, and termination + +Some events provide no information in addition to the heading. For example: + +``` file +000 (5739.000.000) 2020-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> +... +``` + +!!! note + Each event ends with a line that contains only 3 dots: `...` + +However, some lines have additional information to help you quickly understand where and how your job is running. For example: + +``` file +001 (5739.000.000) 2020-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> + SlotName: slot1@WISC-PATH-IDPL-EP.osgvo-docker-pilot-idpl-7c6575d494-2sj5w + CondorScratchDir = "/pilot/osgvo-pilot-2q71K9/execute/dir_9316" + Cpus = 1 + Disk = 174321444 + GLIDEIN_ResourceName = "WISC-PATH-IDPL-EP" + GPUs = 0 + Memory = 8192 +... +``` +- The `SlotName` is the name of the execution point slot your job was assigned to by HTCondor, and the name of the execution point resource is provided in `GLIDEIN_ResourceName` +- The `CondorScratchDir` is the name of the scratch directory that was created by HTCondor for your job to run inside +- The `Cpu`, `GPUs`, `Disk`, and `Memory` values provide the maximum amount of each resource your job has used while running + +Another example of is the periodic update: + +``` file +006 (5739.000.000) 2020-07-10 10:45:20 Image size of job updated: 72 + 1 - MemoryUsage of job (MB) + 72 - ResidentSetSize of job (KB) +... +``` + +These updates record the amount of memory that the job is using on the execute machine. This can be helpful information, so that in future runs of the job, you can tell HTCondor how much memory you will need. + +The job termination event includes a lot of very useful information: + +``` file +005 (5739.000.000) 2023-07-10 10:46:11 Job terminated. + (1) Normal termination (return value 0) + Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage + Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage + Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage + Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage + 0 - Run Bytes Sent By Job + 27848 - Run Bytes Received By Job + 0 - Total Bytes Sent By Job + 27848 - Total Bytes Received By Job + Partitionable Resources : Usage Request Allocated + Cpus : 1 1 + Disk (KB) : 40 30 4203309 + Memory (MB) : 1 1 1 +Job terminated of its own accord at 2023-07-10 10:46:11 with exit-code 0. +... +``` + +Probably the most interesting information is: + +- The `return value` or `exit code` (`0` here, means the executable completed and didn't indicate any internal errors; non-zero usually means failure) +- The total number of bytes transferred each way, which could be useful if your network is slow +- The `Partitionable Resources` table, especially disk and memory usage, which will inform larger submissions. + +There are many other kinds of events, but the ones above will occur in almost every job log. + + +Understanding When Job Log Events Are Written +--------------------------------------------- + +When are events written to the job log file? Let’s find out. Read through the entire procedure below before starting, because some parts of the process are time sensitive. + +1. Change the `sleep` job submit file, so that the job sleeps for 2 minutes (= 120 seconds) +1. Submit the updated sleep job +1. As soon as the `condor_submit` command finishes, hit the return key a few times, to create some blank lines +1. Right away, run a command to show the log file and **keep showing** updates as they occur: + + :::console + username@ap1 $ tail -f sleep.log + +1. Watch the output carefully. When do events appear in the log file? +1. After the termination event appears, press Control-C to end the `tail` command and return to the shell prompt. + + +Understanding How HTCondor Writes Files +--------------------------------------- + +When HTCondor writes the output, error, and log files, does it erase the previous contents of the file or does it add new lines onto the end? Let’s find out! + +For this exercise, we can use the `hostname` job from earlier. + +1. Edit the `hostname` submit file so that it uses new and unique filenames for output, error, and log files. +Alternatively, delete any existing output, error, and log files from previous runs of the `hostname` job. +1. Submit the job three separate times in a row (there are better ways to do this, which we will cover in the next lecture) +1. Wait for all three jobs to finish +1. Examine the output file: How many hostnames are there? Did HTCondor erase the previous contents for each job, or add new lines? +1. Examine the log file… carefully: What happened there? Pay close attention to the times and job IDs of the events. + +For further clarification about how HTCondor handles these files, reach out to your mentor or one of the other school staff. diff --git a/docs/materials/htcondor/part1-ex5-request.md b/docs/materials/htcondor/part1-ex5-request.md new file mode 100644 index 0000000..7a57bb2 --- /dev/null +++ b/docs/materials/htcondor/part1-ex5-request.md @@ -0,0 +1,145 @@ +--- +status: testing +--- + + + +HTC Exercise 1.5: Declare Resource Needs +=========================================== + +The goal of this exercise is to demonstrate how to test and tune the `request_X` statements in a submit file for when you don't know what resources your job needs. + +There are three special resource request statements that you can use (optionally) in an HTCondor submit file: + +- `request_cpus` for the number of CPUs your job will use. A value of "1" is always a great starting point, but some software can use more than "1" (however, most softwares will use an argument to control this number). +- `request_memory` for the maximum amount of run-time memory your job may use. +- `request_disk` for the maximum amount of disk space your job may use (including the executable and all other data that may show up during the job). + +HTCondor defaults to certain reasonable values for these request settings, so you do not need to use them to get *small* jobs to run. +However, it is in **YOUR** best interest to always estimate resource requests before submitting any job, and to definitely tune your requests before submitting multiple jobs. In many HTCondor pools: + +- If your job goes over the request values, it may be removed from the execute machine and held (status 'H' in the `condor_q` output, awaiting action on your part) without saving any partial job output files. So it is a disadvantage to not declare your resource needs or if you underestimate them. +- Conversely, if you overestimate them by too much, your jobs will match to fewer slots and take longer to match to a slot to begin running. Additionally, by hogging up resources that you don't need, other users may be deprived of the resources they require. In the long run, it works better for all users of the pool if you declare what you really need. + +But how do you know what to request? In particular, we are concerned with memory and disk here; requesting multiple CPUs and using them is covered a bit in later school materials, but true HTC splits work up into jobs that each use as few CPU cores as possible (one CPU core is always best to have the most jobs running). + +Determining Resource Needs Before Running Any Jobs +-------------------------------------------------- + +!!! note + If you are running short on time, you can skip to "Determining Resource Needs By Running Test Jobs", below, but try to come back and read over this part at some point. + +It can be very difficult to predict the memory needs of your running program without running tests. Typically, the memory size of a job changes over time, making the task even trickier. +If you have knowledge ahead of time about your job’s maximum memory needs, use that, or maybe a number that's just a bit higher, to ensure your job has enough memory to complete. If this is your first time running your job, you can request a fairly large amount of memory (as high as what's on your laptop or other server, if you know your program can run without crashing) for a first test job, OR you can run the program locally and "watch" it: + +### Examining a Running Program on a Local Computer + +When working on a shared access point, you should not run computationally-intensive work because it can use resources needed by HTCondor to manage the queue for all uses. +However, you may have access to other computers (your laptop, for example, or another server) where you can observe the memory usage of a program. The downside is that you'll have to watch a program run for essentially the entire time, to make sure you catch the maximum memory usage. + +#### For Memory: + +On Mac and Windows, for example, the "Activity Monitor" and "Task Manager" applications may be useful. On a Mac or Linux system, you can use the `ps` command or the `top` command in the Terminal to watch a running program and see (roughly) how much memory it is using. Full coverage of these tools is beyond the scope of this exercise, but here are two quick examples: + +Using `ps`: + +``` console +username@learn $ ps ux +USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND +alice 24342 0.0 0.0 90224 1864 ? S 13:39 0:00 sshd: alice@pts/0 +alice 24343 0.0 0.0 66096 1580 pts/0 Ss 13:39 0:00 -bash +alice 25864 0.0 0.0 65624 996 pts/0 R+ 13:52 0:00 ps ux +alice 30052 0.0 0.0 90720 2456 ? S Jun22 0:00 sshd: alice@pts/2 +alice 30053 0.0 0.0 66096 1624 pts/2 Ss+ Jun22 0:00 -bash +``` + +The Resident Set Size (`RSS`) column, highlighted above, gives a rough indication of the memory usage (in KB) of each running process. If your program runs long enough, you can run this command several times and note the greatest value. + +Using `top`: + +``` console +username@ap1 $ top -u +top - 13:55:31 up 11 days, 20:59, 5 users, load average: 0.12, 0.12, 0.09 +Tasks: 198 total, 1 running, 197 sleeping, 0 stopped, 0 zombie +Cpu(s): 1.2%us, 0.1%sy, 0.0%ni, 98.5%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st +Mem: 4001440k total, 3558028k used, 443412k free, 258568k buffers +Swap: 4194296k total, 148k used, 4194148k free, 2960760k cached + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND +24342 alice 15 0 90224 1864 1096 S 0.0 0.0 0:00.26 sshd +24343 alice 15 0 66096 1580 1232 S 0.0 0.0 0:00.07 bash +25927 alice 15 0 12760 1196 836 R 0.0 0.0 0:00.01 top +30052 alice 16 0 90720 2456 1112 S 0.0 0.1 0:00.69 sshd +30053 alice 18 0 66096 1624 1236 S 0.0 0.0 0:00.37 bash +``` + +The `top` command (shown here with an option to limit the output to a single user ID) also shows information about running processes, but updates periodically by itself. Type the letter `q` to quit the interactive display. Again, the highlighted `RES` column shows an approximation of memory usage. + +#### For Disk: +Determining disk needs may be a bit easier, because you can check on the size of files that a program is using while it runs. However, it is important to count all files that HTCondor counts to get an accurate size. HTCondor counts **everything** in your job sandbox toward your job’s disk usage: + +- The executable itself +- All "input" files (anything else that gets transferred TO the job, even if you don't think of it as "input") +- All files created during the job (broadly defined as "output"), including the captured standard output and error files that you list in the submit file. +- All temporary files created in the sandbox, even if they get deleted by the executable before it's done. + +If you can run your program within a single directory on a local computer (not on the access point), you should be able to view files and their sizes with the `ls` and `du` commands. + +Determining Resource Needs By Running Test Jobs (BEST) +------------------------------------------------------ + +Despite the techniques mentioned above, by far the easiest approach to measuring your job’s resource needs is to run one or a small number of sample jobs and have HTCondor itself tell you about the resources used during the runs. + +For example, here is a strange Python script that does not do anything useful, but consumes some real resources while running: + +``` python +#!/usr/bin/env python3 +import time +import os +size = 1000000 +numbers = [] +for i in range(size): numbers.append(str(i)) +with open('numbers.txt', 'w') as tempfile: + tempfile.write(' '.join(numbers)) +time.sleep(60) +``` + +Without trying to figure out what this code does or how many resources it uses, create a submit file for it, +and run it once with HTCondor, starting with somewhat high memory requests ("1GB" for memory and disk is a good starting point, unless you think the job will use far more). +When it is done, examine the log file. In particular, we care about these lines: + +``` file + Partitionable Resources : Usage Request Allocated + Cpus : 1 1 + Disk (KB) : 6739 1048576 8022934 + Memory (MB) : 3 1024 1024 +``` + +So, now we know that HTCondor saw that the job used 6,739 KB of disk (= about 6.5 MB) and 3 MB of memory! + +This is a great technique for determining the real resource needs of your job. If you think resource needs vary from run to run, submit a few sample jobs and look at all the results. You should round up your resource requests a little, just in case your job occasionally uses more resources. + +Setting Resource Requirements +----------------------------- + +Once you know your job’s resource requirements, it is easy to declare them in your submit file. For example, taking our results above as an example, we might slightly increase our requests above what was used, just to be safe: + +```hl_lines="1 3" +# rounded up from 3 MB +request_memory = 4MB +# rounded up from 6.5 MB +request_disk = 7MB +``` + +Pay close attention to units: + +- Without explicit units, `request_memory` is in MB (megabytes) +- Without explicit units, `request_disk` is in KB (kilobytes) +- Allowable units are `KB` (kilobytes), `MB` (megabytes), `GB` (gigabytes), and `TB` (terabytes) + +HTCondor translates these requirements into attributes that become part of the job's `requirements` expression. However, do not put your CPU, memory, and disk requirements directly into the `requirements` expression; use the `request_XXX` statements instead. + +**If you still have time in this working session, Add these requirements to your submit file for the Python script, rerun the job, and confirm in the log file that your requests were used.** + +After changing the requirements in your submit file, did your job run successfully? If not, why? +(Hint: HTCondor polls a job's resource use on a timer. How long are these jobs running for?) diff --git a/docs/materials/htcondor/part1-ex6-remove.md b/docs/materials/htcondor/part1-ex6-remove.md new file mode 100644 index 0000000..8431e65 --- /dev/null +++ b/docs/materials/htcondor/part1-ex6-remove.md @@ -0,0 +1,64 @@ +--- +status: testing +--- + + + +HTC Exercise 1.6: Remove Jobs From the Queue +=============================================== + +## Exercise Goal + +The goal of this exercise is to show you how to remove jobs from the queue. This is helpful if you make a mistake, do not want to wait for a job to complete, or otherwise need to fix things. For example, if some test jobs go on hold for using too much memory or disk, you may want to just remove them, edit the submit files, and then submit again. + +**Skip this exercise and come back to it if you are short on time, or until you need to remove jobs for other exercises** + +!!! note + Please remember to remove any jobs from the queue that you are no longer interested in. Otherwise, the queue will start to get very long with jobs that will waste resources (and decrease your priority), or that may never run (if they're on hold, or have other issues keeping them from matching). + +This exercise is very short, but if you are out of time, you can come back to it later. + +## Removing a Job or Cluster From the Queue + +To practice removing jobs from the queue, you need a job in the queue! + +1. Submit a job from an earlier exercise +1. Determine the job ID (`cluster.process`) from the `condor_submit` output or from `condor_q` +1. Remove the job: + + :::console + username@ap1 $ condor_rm + + Use the full job ID this time, e.g. `5759.0`. + +1. Did the job leave the queue immediately? If not, about how long did it take? + +So far, we have created job clusters that contain only one job process (the `.0` part of the job ID). That will change soon, so it is good to know how to remove a specific job ID. However, it is possible to remove all jobs that are part of a cluster at once. Simply omit the job process (the `.0` part of the job ID) in the `condor_rm` command: + +``` console +username@ap1 $ condor_rm +``` + +Finally, you can include many job clusters and full job IDs in a single `condor_rm` command. For example: + +``` console +username@ap1 $ condor_rm 5768 5769 5770.0 5771.2 +``` + +## Removing All of Your Jobs + +If you really want to remove all of your jobs at once, you can do that with: + +```console +username@ap1 $ condor_rm +``` + +If you want to test it: (optional, though you'll likely need this in the future) + +1. Quickly submit several jobs from past exercises +1. View the jobs in the queue with `condor_q` +1. Remove them all with the above command +1. Use `condor_q` to track progress + +In case you are wondering, you can remove only your own jobs. +HTCondor administrators can remove anyone’s jobs, so be nice to them. diff --git a/docs/materials/htcondor/part1-ex7-compile.md b/docs/materials/htcondor/part1-ex7-compile.md new file mode 100644 index 0000000..94a32c7 --- /dev/null +++ b/docs/materials/htcondor/part1-ex7-compile.md @@ -0,0 +1,95 @@ +--- +status: testing +--- + + + +HTC Bonus Exercise 1.7: Compile and Run Some C Code +====================================================== + +The goal of this exercise is to show that compiled code works just fine in HTCondor. It is mainly of interest to people who have their own C code to run (or C++, or really any compiled code, although Java would be handled a bit differently). + +Preparing a C Executable +------------------------ + +When preparing a C program for HTCondor, it is best to compile and link the executable statically, so that it does not depend on external libraries and their particular versions. Why is this important? When your compiled C program is sent to another machine for execution, that machine may not have the same libraries that you have on your submit machine (or wherever you compile the program). If the libraries are not available or are the wrong versions, your program may fail or, perhaps worse, silently produce the wrong results. + +Here is a simple C program to try using (thanks, Alain Roy): + +``` c +#include +#include +#include + +int main(int argc, char **argv) +{ + int sleep_time; + int input; + int failure; + + if (argc != 3) { + printf("Usage: simple \n"); + failure = 1; + } else { + sleep_time = atoi(argv[1]); + input = atoi(argv[2]); + + printf("Thinking really hard for %d seconds...\n", sleep_time); + sleep(sleep_time); + printf("We calculated: %d\n", input * 2); + failure = 0; + } + return failure; +} +``` + +Save that code to a file, for example, `simple.c`. + +Compile the program with static linking: + +``` console +username@learn $ gcc -static -o simple simple.c +``` + +As always, test that you can run your command from the command line first. First, without arguments to make sure it fails correctly: + +``` console +username@learn $ ./simple +``` + +and then with valid arguments: + +``` console +username@learn $ ./simple 5 21 +``` + +Running a Compiled C Program +---------------------------- + +Running the compiled program is no different than running any other program. Here is a submit file for the C program (call it simple.sub): + +``` file +executable = simple +arguments = "60 64" + +output = c-program.out +error = c-program.err +log = c-program.log + +should_transfer_files = YES +when_to_transfer_output = ON_EXIT + +request_cpus = 1 +request_memory = 1GB +request_disk = 1MB + +queue +``` + +Then submit the job as usual! + +In summary, it is easy to work with statically linked compiled code. +It **is** possible to handle dynamically linked compiled code, but it is trickier. +We will only mention this topic briefly during the lecture on Software. + + diff --git a/docs/materials/htcondor/part1-ex8-queue.md b/docs/materials/htcondor/part1-ex8-queue.md new file mode 100644 index 0000000..0cfc079 --- /dev/null +++ b/docs/materials/htcondor/part1-ex8-queue.md @@ -0,0 +1,234 @@ +--- +status: testing +--- + + + +Bonus HTC Exercise 1.8: Explore condor_q +====================================== + +The goal of this exercise is try out some of the most common options to the `condor_q` command, so that you can view jobs effectively. + +The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a `condor_q` expert! + +Selecting Jobs +-------------- + +The `condor_q` program has many options for selecting which jobs are listed. You have already seen that the default mode is to show only your jobs in "batch" mode: + +``` console +username@learn $ condor_q +``` + +You've seen that you can view all jobs (all users) in the submit node's queue by using the `-all` argument: + +``` console +username@learn $ condor_q -all +``` + +And you've seen that you can view more details about queued jobs, with each separate job on a single line using the `-nobatch` option: + +``` console +username@learn $ condor_q -nobatch +username@learn $ condor_q -all -nobatch +``` + +Did you know you can also name one or more user IDs on the command line, in which case jobs for all of the named users are listed at once? + +``` console +username@learn $ condor_q +``` + +To list just the jobs associated with a single cluster number: + +``` console +username@learn $ condor_q +``` + +For example, if you want to see the jobs in cluster 5678 (i.e., `5678.0`, `5678.1`, etc.), you use `condor_q 5678`. + +To list a specific job (i.e., cluster.process, as in 5678.0): + +``` console +username@learn $ condor_q +``` + +For example, to see job ID 5678.1, you use `condor_q 5678.1`. + +!!! note + You can name more than one cluster, job ID, or combination thereof on the command line, in which case jobs for + **all** of the named clusters and/or job IDs are listed. + +Let’s get some practice using `condor_q` selections! + +1. Using a previous exercise, submit several `sleep` jobs. +1. List all jobs in the queue — are there others besides your own? +1. Practice using all forms of `condor_q` that you have learned: + - List just your jobs, with and without batching. + - List a specific cluster. + - List a specific job ID. + - Try listing several users at once. + - Try listing several clusters and job IDs at once. +1. When there are a variety of jobs in the queue, try combining a username and a different user's cluster or job ID in the same command — what happens? + +Viewing a Job ClassAd +--------------------- + +You may have wondered why it is useful to be able to list a single job ID using `condor_q`. By itself, it may not be that useful. But, in combination with another option, it is very useful! + +If you add the `-long` option to `condor_q` (or its short form, `-l`), it will show the complete ClassAd for each selected job, instead of the one-line summary that you have seen so far. Because job ClassAds may have 80–90 attributes (or more), it probably makes the most sense to show the ClassAd for a single job at a time. And you know how to show just one job! Here is what the command looks like: + +``` console +username@learn $ condor_q -long +``` + +The output from this command is long and complex. Most of the attributes that HTCondor adds to a job are arcane and uninteresting for us now. But here are some examples of common, interesting attributes taken directly from `condor_q` output (except with some line breaks added to the `Requirements` attribute): + +``` file +MyType = "Job" +Err = "sleep.err" +UserLog = "/home/cat/intro-2.1-queue/sleep.log" +Requirements = ( IsOSGSchoolSlot =?= true ) && + ( TARGET.Arch == "X86_64" ) && + ( TARGET.OpSys == "LINUX" ) && + ( TARGET.Disk >= RequestDisk ) && + ( TARGET.Memory >= RequestMemory ) && + ( TARGET.HasFileTransfer ) +ClusterId = 2420 +WhenToTransferOutput = "ON_EXIT" +Owner = "cat" +CondorVersion = "$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $" +Out = "sleep.out" +Cmd = "/bin/sleep" +Arguments = "120" +``` + +!!! note + Attributes are listed in no particular order and may change from time to time. + Do not assume anything about the order of attributes in `condor_q` output. + +**See what you can find in a job ClassAd from your own job.** + +1. Using a previous exercise, submit a `sleep` job that sleeps for at least 3 minutes (180 seconds). +1. Before the job executes, capture its ClassAd and save to a file: + + :::console + condor_q -l > classad-1.txt + +1. After the job starts execution but before it finishes, capture its ClassAd again and save to a file + + :::console + condor_q -l > classad-2.txt + +Now examine each saved ClassAd file. Here are a few things to look for: + +- Can you find attributes that came from your submit file? (E.g., Cmd, Arguments, Out, Err, UserLog, and so forth) +- Can you find attributes that could have come from your submit file, but that HTCondor added for you? (E.g., Requirements) +- How many of the following attributes can you guess the meaning of? + - DiskUsage + - ImageSize + - BytesSent + - JobStatus + +Why Is My Job Not Running? +-------------------------- + +Sometimes, you submit a job and it just sits in the queue in Idle state, never running. It can be difficult to figure out why a job never matches and runs. Fortunately, HTCondor can give you some help. + +To ask HTCondor why your job is not running, add the `-better-analyze` option to `condor_q` for the specific job. For example, for job ID 2423.0, the command is: + +``` console +username@learn $ condor_q -better-analyze 2423.0 +``` + +Of course, replace the job ID with your own. + +Let’s submit a job that will never run and see what happens. Here is the submit file to use: + +``` file +executable = /bin/hostname +output = norun.out +error = norun.err +log = norun.log +should_transfer_files = YES +when_to_transfer_output = ON_EXIT +request_disk = 10MB +request_memory = 8TB +queue +``` + +(Do you see what I did?) + +1. Save and submit this file. +1. Run `condor_q -better-analyze` on the job ID. + +There is a lot of output, but a few items are worth highlighting. Here is a sample from my own job (with some lines omitted): + +``` file + +-- Schedd: learn.chtc.wisc.edu : <128.104.100.148:9618?... +... + +Job 98096.000 defines the following attributes: + + RequestDisk = 10240 + RequestMemory = 8388608 + +The Requirements expression for job 98096.000 reduces to these conditions: + + + Slots +Step Matched Condition +----- -------- --------- +[1] 11227 Target.OpSysMajorVer == 7 +[9] 13098 TARGET.Disk >= RequestDisk +[11] 0 TARGET.Memory >= RequestMemory + +No successful match recorded. +Last failed match: Fri Jul 12 15:36:30 2019 + +Reason for last match failure: no match found + +98096.000: Run analysis summary ignoring user priority. Of 710 machines, + 710 are rejected by your job's requirements + 0 reject your job because of their own requirements + 0 match and are already running your jobs + 0 match but are serving other users + 0 are able to run your job +... +``` + +At the end of the summary, `condor_q` provides a breakdown of how **machines** and their own requirements match against my own job's requirements. 710 total machines were considered above, and **all** of them were rejected based on **my job's requirements**. In other words, I am asking for something that is not available. But what? + +Further up in the output, there is an analysis of the job's requirements, along with how many slots within the pool match each of those requirements. The example above reports that 13098 slots match our small disk request request, but **none** of the slots matched the `TARGET.Memory >= RequestMemory` condition. The output also reports the value used for the `RequestMemory` attribute: my job asked for **8 terabytes** of memory (8,388,608 MB) -- of course no machines matched that part of the expression! That's a lot of memory on today's machines. + +The output from `condor_q -analyze` (and `condor_q -better-analyze`) may be helpful or it may not be, depending on your exact case. The example above was constructed so that it would be obvious what the problem was. But in many cases, this is a good place to start looking if you are having problems matching. + +Bonus: Automatic Formatting Output +---------------------------------- + +**Do this exercise only if you have time, though it's pretty awesome!** + +There is a way to select the specific job attributes you want `condor_q` to tell you about with the `-autoformat` or `-af` option. In this case, HTCondor decides for you how to format the data you ask for from job ClassAd(s). +(To tell HTCondor how to specially format this information, yourself, you could use the `-format` option, which we're not covering.) + +To use autoformatting, use the `-af` option followed by the attribute name, for each attribute that you want to output: + +``` console +username@learn $ condor_q -all -af Owner ClusterId Cmd +moate 2418 /share/test.sh +cat 2421 /bin/sleep +cat 2422 /bin/sleep +``` + +**Bonus Question**: If you wanted to print out the `Requirements` expression of a job, how would you do that with `-af`? Is the output what you expected? (HINT: for ClassAd attributes like "Requirements" that are long expressions, instead of plain values, you can use `-af:r` to view the expressions, instead of what it's current evaluation.) + +References +---------- + +As suggested above, if you want to learn more about `condor_q`, you can do some reading: + +- Read the `condor_q` man page or HTCondor Manual section (same text) to learn about more options +- Read about ClassAd attributes in Appendix A of the HTCondor Manual + + diff --git a/docs/materials/htcondor/part1-ex9-status.md b/docs/materials/htcondor/part1-ex9-status.md new file mode 100644 index 0000000..b5bd884 --- /dev/null +++ b/docs/materials/htcondor/part1-ex9-status.md @@ -0,0 +1,142 @@ +--- +status: in progress +--- + + + +Bonus HTC Exercise 1.9: Explore condor_status +=========================================== + +The goal of this exercise is try out some of the most common options to the `condor_status` command, so that you can view slots effectively. + +The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a `condor_status` expert! + +Selecting Slots +--------------- + +The `condor_status` program has many options for selecting which slots are listed. You've already learned the basic `condor_status` and the `condor_status -compact` variation (which you may wish to retry now, before proceeding). + +Another convenient option is to list only those slots that are available now: + +``` console +username@learn $ condor_status -avail +``` + +Of course, the individual execute machines only report their slots to the collector at certain time intervals, so this list will not reflect the up-to-the-second reality of all slots. But this limitation is true of all `condor_status` output, not just with the `-avail` option. + +Similar to `condor_q`, you can limit the slots that are listed in two easy ways. To list just the slots on a specific machine: + +``` console +username@learn $ condor_status +``` + +For example, if you want to see the slots on `e2337.chtc.wisc.edu` (in the CHTC pool): + +``` console +username@learn $ condor_status e2337.chtc.wisc.edu +``` + +To list a specific slot on a machine: + +``` console +username@learn $ condor_status @ +``` + +For example, to see the “first” slot on the machine above: + +``` console +username@learn $ condor_status slot1@e2337.chtc.wisc.edu +``` + +!!! note + You can name more than one hostname, slot, or combination thereof on the command line, in which case slots for + **all** of the named hostnames and/or slots are listed. + +Let’s get some practice using `condor_status` selections! + +1. List all slots in the pool — how many are there total? +1. Practice using all forms of `condor_status` that you have learned: + - List the available slots. + - List the slots on a specific machine (e.g., `e2337.chtc.wisc.edu`). + - List a specific slot from that machine. + - Try listing the slots from a few (but not all) machines at once. + - Try using a mix of hostnames and slot IDs at once. + +Viewing a Slot ClassAd +---------------------- + +Just as with `condor_q`, you can use `condor_status` to view the complete ClassAd for a given slot (often confusingly called the “machine” ad): + +``` console +username@learn $ condor_status -long @ +``` + +Because slot ClassAds may have 150–200 attributes (or more), it probably makes the most sense to show the ClassAd for a single slot at a time, as shown above. + +Here are some examples of common, interesting attributes taken directly from `condor_status` output: + +``` file +OpSys = "LINUX" +DetectedCpus = 24 +OpSysAndVer = "SL6" +MyType = "Machine" +LoadAvg = 0.99 +TotalDisk = 798098404 +OSIssue = "Scientific Linux release 6.6 (Carbon)" +TotalMemory = 24016 +Machine = "e242.chtc.wisc.edu" +CondorVersion = "$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $" +Memory = 1024 +``` + +As you may be able to tell, there is a mix of attributes about the machine as a whole (hence the name “machine ad”) and about the slot in particular. + +Go ahead and examine a machine ClassAd now. I suggest looking at one of the slots on, say, `e2337.chtc.wisc.edu` because of its relatively simple configuration. + +Viewing Slots by ClassAd Expression +----------------------------------- + +Often, it is helpful to view slots that meet some particular criteria. For example, if you know that your job needs a lot of memory to run, you may want to see how many high-memory slots there are and whether they are busy. You can filter the list of slots like this using the `-constraint` option and a ClassAd expression. + +For example, suppose we want to list all slots that are running Scientific Linux 7 (operating system) and have at least 16 GB memory available. Note that memory is reported in units of Megabytes. The command is: + +``` console +username@learn $ condor_status -constraint 'OpSysAndVer == "CentOS7" && Memory >= 16000' +``` + +!!! note + Be very careful with using quote characters appropriately in these commands. + In the example above, the single quotes (`'`) are for the shell, so that the entire expression is passed to + `condor_status` untouched, and the double quotes (`"`) surround a string value within the expression itself. + +Currently on CHTC, there are only a few slots that meet these criteria (our high-memory servers, mainly used for metagenomics assemblies). + +If you are interested in learning more about writing ClassAd expressions, look at section 4.1 and especially 4.1.4 of the HTCondor Manual. This is definitely advanced material, so if you do not want to read it, that is fine. But if you do, take some time to practice writing expressions for the `condor_status -constraint` command. + +!!! note + The `condor_q` command accepts the `-constraint` option as well! + As you might expect, the option allows you to limit the jobs that are listed based on a ClassAd expression. + +Bonus: Formatting Output +---------------------------- + +The `condor_status` command accepts the same `-autoformat` (`-af`) options that `condor_q` accepts, and the options have the same meanings in both commands. Of course, the attributes available in machine ads may differ from the ones that are available in job ads. Use the HTCondor Manual or look at individual slot ClassAds to get a better idea of what attributes are available. + +For example, I was curious about the host name and operating system of the slots with more than 32GB of memory: + +``` console +username@learn $ condor_status -af Machine -af OpSysAndVer -constraint 'Memory >= 32000' +``` + +If you like, spend a few minutes now or later experimenting with `condor_status` formatting. + +References +---------- + +As suggested above, if you want to learn more about `condor_q`, you can do some reading: + +- Read the `condor_status` man page or HTCondor Manual section (same text) to learn about more options +- Read about [ClassAd attributes](https://htcondor.readthedocs.io/en/latest/classad-attributes/index.html) in the appendix of the HTCondor Manual +- Read about [ClassAd expressions](https://htcondor.readthedocs.io/en/latest/misc-concepts/classad-mechanism.html#old-classads-in-the-htcondor-system) in section 4.1.4 of the HTCondor Manual + + diff --git a/docs/materials/htcondor/part2-ex1-files.md b/docs/materials/htcondor/part2-ex1-files.md new file mode 100644 index 0000000..e9d4724 --- /dev/null +++ b/docs/materials/htcondor/part2-ex1-files.md @@ -0,0 +1,186 @@ +--- +status: testing +--- + + + +HTC Exercise 2.1: Work With Input and Output Files +===================================================== + +Exercise Goal +------------- + +The goal of this exercise is make input files available to your job on the execute machine and to return output files back created in your job back to you on the access point. This small change significantly adds to the kinds of jobs that you can run. + +Viewing a Job Sandbox +--------------------- + +Before you learn to transfer files to and from your job, it is good to understand a bit more about the environment in which your job runs. +When the HTCondor `starter` process prepares to run your job, it creates a new directory for your job and all of its files. +We call this directory the *job sandbox*, because it is your job’s private space to play. +Let’s see what is in the job sandbox for a minimal job with no special input or output files. + +1. Save the script below in a file named `sandbox.sh`: + + :::bash + #!/bin/sh + echo 'Date: ' `date` + echo 'Host: ' `hostname` + echo 'Sandbox: ' `pwd` + ls -alF + # END + +1. Create a submit file for this script and submit it. +1. When the job finishes, look at the contents of the output file. + +In the output file, note the `Sandbox:` line: That is the full path to your job sandbox for the run. It was created just for your job, and it was removed as soon as your job finished. + +Next, look at the output that appears after the `Sandbox:` line; it is the output from the `ls` command in the script. It shows all of the files in your job sandbox, as they existed at the end of the execution of `sandbox.sh`. The number of files that you see can change depending on the HTC system you are using, but some of the files you should always see are: + +| | | +|-------------------|---------------------------------------------| +| `.chirp.config` | Configuration for an advanced feature | +| `sandbox.sh` | Your executable | +| `.job.ad` | The job ClassAd | +| `.machine.ad` | The machine ClassAd | +| `_condor_stderr` | Saved standard error from the job | +| `_condor_stdout` | Saved standard output from the job | +| `tmp/`, `var/tmp/`| Directories in which to put temporary files | + +So, HTCondor wrote copies of the job and machine ads (for use by the job, if desired), transferred your executable (`sandbox.sh`), ran it, and saved its standard output and standard error into files. Notice that your submit file, which was in the same directory on the access point machine as your executable, was **not** transferred, nor were any other files that happened to be in directory with the submit file. + +Now that we know something about the sandbox, we can transfer more files to and from it. + +Running a Job With Input Files +------------------------------ + +Next, you will run a job that requires an input file. Remember, the initial job sandbox will contain only the job executable, unless you tell HTCondor explicitly about every other file that needs to be transferred to the job. + +Here is a Python script that takes the name of an input file (containing one word per line) from the command line, counts the number of times each (lowercased) word occurs in the text, and prints out the final list of words and their counts. + +``` python +#!/usr/bin/env python3 + +import os +import sys + +if len(sys.argv) != 2: + print(f'Usage: {os.path.basename(sys.argv[0])} DATA') + sys.exit(1) +input_filename = sys.argv[1] + +words = {} + +with open(input_filename, 'r', encoding='iso-8859-1') as my_file: + for line in my_file: + word = line.strip().lower() + if word in words: + words[word] += 1 + else: + words[word] = 1 + +for word in sorted(words.keys()): + print(f'{words[word]:8d} {word}') +``` + +1. Create and save the Python script in a file named `freq.py`. +1. Download the input file for the script (263K lines, ~1.4 MB) and save it in your submit directory: + + :::console + username@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/intro-2.1-words.txt + +1. Create a submit file for the `freq.py` executable. +1. Add a line called `transfer_input_files = ` to tell HTCondor to transfer the input file to the job: + + :::file + transfer_input_files = intro-2.1-words.txt + + As with all submit file commands, it does not matter where this line goes, as long as it comes before the word `queue`. + +1. Since we want HTCondor to pass an argument to our Python executable, we need to remember to add an `arguments = ` line in our submit file so that HTCondor knows to pass an argument to the job. Set this `arguments = ` line equal to the argument to the Python script (i.e., the name the input file). +1. Submit the job to HTCondor, wait for it to finish, and check the output! + +If things do not work the first time, keep trying! At this point in the exercises, we are telling you less and less explicitly how to do steps that you have done before. If you get stuck, ask for help in the Slack channel. + +!!! note + If you want to transfer more than one input file, list all of them on a single `transfer_input_files` command, + separated by commas. + For example, if there are three input files: + + transfer_input_files = a.txt, b.txt, c.txt + + +Transferring Output Files +------------------------- + +So far, we have relied on programs that send their output to the standard output and error streams, which HTCondor captures, saves, and returns back to the submit directory. But what if your program writes one or more files for its output? How do you tell HTCondor to bring them back? + +Let’s start by exploring what happens to files that a job creates in the sandbox. +We will use a very simple method for creating a new file: we will copy an input file to another name. + +1. Find or create a small input file (it is fine to use any small file from a previous exercise). +1. Create a submit file that transfers the input file and copies it to another name (as if doing `/bin/cp input.txt output.txt` on the command line) + - Make the output filename different than any filenames that are in your submit directory + - What is the `executable` line? + - What is the `arguments` line? + - How do you tell HTCondor to transfer the input file? + - As always, use `output`, `error`, and `log` filenames that are different from previous exercises +1. Submit the job and wait for it to finish. + +What happened? Can you tell what HTCondor did with the output file that was created (did it end up back on the access point?), after it was created in the job sandbox? Look carefully at the list of files in your submit directory now. + +Transferring Specific Output Files +---------------------------------- + +As you saw in the last exercise, by default HTCondor transfers files that are created in the job sandbox back to the submit directory when the job finishes. In fact, HTCondor will also transfer back **changed** input files, too. But, this only works for files that are in the top-level sandbox directory, and **not** for ones contained in subdirectories. + +What if you want to bring back only **some** output files, or output files contained in subdirectories? + +Here is a shell script that creates several files, including a copy of an input file in a new subdirectory: + +``` shell +#!/bin/sh +if [ $# -ne 1 ]; then echo "Usage: $0 INPUT"; exit 1; fi +date > output-timestamp.txt +cal > output-calendar.txt +mkdir subdirectory +cp $1 subdirectory/backup-$1 +``` + +First, let’s confirm that HTCondor does not bring back the output file (which starts with the prefix `backup-`) in the subdirectory: + +1. Create a file called `output.sh` and save the above shell script in this file. +1. Write a submit file that transfers any input file and runs `output.sh` on it (remember to include an `arguments = ` line and pass the input filename as an argument). +1. Submit the job, wait for it to finish, and examine the contents of your submit directory. + +Suppose you decide that you want only the timestamp output file and all files in the subdirectory, but not the calendar output file. You can tell HTCondor to only transfer these specific files back to the submission directory using `transfer_output_files =`: + +``` file +transfer_output_files = output-timestamp.txt, subdirectory/ +``` + +When using `transfer_output_files =`, HTCondor will only transfer back the files you name - all other files will be ignored and deleted at the end of a job. + +!!! note + See the trailing slash (`/`) on the subdirectory? + That tells HTCondor to transfer back **the files contained in the subdirectory, but not the directory itself**; + the files will be written directly into the submit directory. + If you want HTCondor to transfer back an entire directory, leave off the trailing slash. + +1. Remove all output files from the previous run, including `output-timestamp.txt` and `output-calendar.txt`. +1. Copy the previous submit file that ran `output.sh` and add the `transfer_output_files` line from above. +1. Submit the job, wait for it to finish, and examine the contents of your submit directory. + +Did it work as you expected? + +Thinking About Progress So Far +------------------------------ + +At this point, you can do just about everything that you need in order to run jobs on a HTC pool. You can identify the executable, arguments, and input files, and you can get output back from the job. This is a big achievement! + +References +---------- + +There are many more details about HTCondor’s file transfer mechanism not covered here. +For more information, read ["Submitting Jobs Without a Shared Filesystem"](https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html) in the HTCondor Manual. + diff --git a/docs/materials/htcondor/part2-ex2-queue-n.md b/docs/materials/htcondor/part2-ex2-queue-n.md new file mode 100644 index 0000000..404862e --- /dev/null +++ b/docs/materials/htcondor/part2-ex2-queue-n.md @@ -0,0 +1,222 @@ +--- +status: testing +--- + + + +HTC Exercise 2.2: Use queue *N*, $(Cluster), and $(Process) +============================================================== + +Background +------------ + +Suppose you have a program that you want to run many times with different arguments each time. With what you know so far, you have a couple of choices: + +- Write one submit file; submit one job, change the argument in the submit file, submit another job, change the submit file, … +- Write many submit files that are nearly identical except for the program argument + +Neither of these options seems very satisfying. Fortunately, HTCondor's `queue` statement is here to help! + + +Exercise Goal +------------- + +The goal of the next several exercises is +to learn to submit many jobs from a single HTCondor `queue` statement, +and to control things like filenames and arguments on a per-job basis when doing so. + + +Running Many Jobs With One queue Statement +------------------------------------------ + +**Example** +Here is a C program that uses a stochastic (random) method to estimate the value of π. The single argument to the program is the number of samples to take. More samples should result in better estimates! + +``` c +#include +#include +#include + +int main(int argc, char *argv[]) +{ + struct timeval my_timeval; + int iterations = 0; + int inside_circle = 0; + int i; + double x, y, pi_estimate; + + gettimeofday(&my_timeval, NULL); + srand48(my_timeval.tv_sec ^ my_timeval.tv_usec); + + if (argc == 2) { + iterations = atoi(argv[1]); + } else { + printf("usage: circlepi ITERATIONS\n"); + exit(1); + } + + for (i = 0; i < iterations; i++) { + x = (drand48() - 0.5) * 2.0; + y = (drand48() - 0.5) * 2.0; + if (((x * x) + (y * y)) <= 1.0) { + inside_circle++; + } + } + pi_estimate = 4.0 * ((double) inside_circle / (double) iterations); + printf("%d iterations, %d inside; pi = %f\n", iterations, inside_circle, pi_estimate); + return 0; +} +``` + +1. In a new directory for this exercise, create and save the code to a file named `circlepi.c` +1. Compile the code (we will cover this in more detail during the Software lecture): + + :::console + username@ap1 $ gcc -o circlepi circlepi.c + +1. Test the program with just 1000 samples: + + :::console + username@ap1 $ ./circlepi 1000 + +Now suppose that you want to run the program many times, to produce many estimates. +To do so, we can tell HTCondor how many jobs to "queue up" via the `queue` statement +we've been putting at the end of each of our submit files. +Let’s see how it works: + +1. Write a normal submit file for this program + - Pass 1 million (`1000000`) as the command line argument to `circlepi` + - Make sure to include `log`, `output`, and `error` (with filenames like `circlepi.log`), and `request_*` lines + - At the end of the file, write `queue 3` instead of just `queue` ("queue 3 jobs" vs. "queue a job"). +1. Submit the file. Note the slightly different message from `condor_submit`: + + :::console + 3 job(s) submitted to cluster *NNNN*. + +1. Before the jobs execute, look at the job queue to see the multiple jobs + +Here is some sample `condor_q -nobatch` output: + +``` console + ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD +10228.0 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 +10228.1 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 +10228.2 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 +``` + +In this sample, all three jobs are part of **cluster** `10228`, +but the first job was assigned **process** `0`, +the second job was assigned process `1`, +and the third one was assigned process `2`. +(Programmers like to start counting from 0.) + +Now we can understand what the first column in the output, the ***job ID***, represents. +It is a job’s *cluster number*, a dot (`.`), and the job’s *process number*. +So in the example above, the job ID of the second job is `10228.1`. + +**Pop Quiz:** Do you remember how to ask HTCondor's queue to list the status of all of the jobs from one cluster? How about one specific job ID? + +Using queue *N* With Output +--------------------------- + +When all three jobs in your single cluster are finished, examine the resulting files. + +- What is in the output file? +- What is in the error file? (hopefully it is empty!) +- What is in the log file? Look carefully at the job IDs in each event. +- Is this what you expected? Is it what you wanted? If the output is not what you expected, what do you think happened? + +Using $(Process) to Distinguish Jobs +------------------------------------ + +As you saw with the experiment above, each job ended up overwriting the same output and error filenames in the submission directory. +After all, we didn't tell it to behave any differently when it ran three jobs. + +We need a way to separate output (and error) files *per job that is queued*, not just for the whole cluster of jobs. Fortunately, HTCondor has a way to separate the files easily. + +When processing a submit file, HTCondor will replace any instance of `$(Process)` with the process number of the job, for each job that is queued. +For example, you can use the `$(Process)` variable to define a separate output file name for each job: + +``` file +output = my-output-file-$(Process).out +queue 10 +``` + +Even though the `output` filename is defined only once, HTCondor will create separate output filenames for each job: + +| | | +|------------------|------------------------| +| First job | `my-output-file-0.out` | +| Second job | `my-output-file-1.out` | +| Third job | `my-output-file-2.out` | +| ... | ... | +| Last (tenth) job | `my-output-file-9.out` | + +Let’s see how this works for our program that estimates π. + +1. In your submit file, change the definitions of `output` and `error` to use `$(Process)` in the filename, similar to the example above. +1. Delete any standard output, standard error, and log files from previous runs. +1. Submit the updated file. + +When all three jobs are finished, examine the resulting files again. + +- How many files are there of each type? What are their names? +- Is this what you expected? Is it what you wanted from the π estimation process? + +Using $(Cluster) to Separate Files Across Runs +---------------------------------------------- + +With `$(Process)`, you can get separate output (and error) filenames for each job within a run. However, the next time you submit the same file, all of the output and error files are overwritten by new ones created by the new jobs. Maybe this is the behavior that you want. But sometimes, you may want to separate files by run, as well. + +In addition to `$(Process)`, there is also a `$(Cluster)` variable that you can use in your submit files. It works just like `$(Process)`, except it is replaced with the cluster number of the entire submission. Because the cluster number is the same for all jobs within a single submission, it does not separate files by job within a submission. But when used **with** `$(Process)`, it can be used to separate files by run. For example, consider this `output` statement: + +``` file +output = my-output-file-$(Cluster)-$(Process).out +``` + +For one particular run, it might result in output filenames like `my-output-file-2444-0.out`, `myoutput-file-2444-1.out`, `myoutput-file-2444-2.out`, etc. + +However, the next run would have different filenames, replacing `2444` with the new Cluster number of that run. + +Using $(Process) and $(Cluster) in Other Statements +--------------------------------------------------- + +The `$(Cluster)` and `$(Process)` variables can be used in any submit file statement, although they are useful in some kinds of submit file statements and not really for others. For example, consider using $(Cluster) or $(Process) in each of the below: + +- `log` +- `transfer_input_files` +- `transfer_output_files` +- `arguments` + +Unfortunately, HTCondor does not easily let you perform math on the `$(Process)` number when using it. So, for example, if you use `$(Process)` as a numeric argument to a command, it will always result in jobs getting the arguments 0, 1, 2, and so on. If you have control over your program and the way in which it uses command-line arguments, then you are fine. Otherwise, you might need a solution like those in the next exercises. + +(Optional) Defining JobBatchName for Tracking +--------------------------------------------- + +It is possible to define arbitrary attributes in your submit file, and that one purpose of such attributes is to track or report on different jobs separately. In this optional exercise, you will see how this technique can be used. + +Once again, we will use `sleep` jobs, so that your jobs remain in the queue long enough to experiment on. + +1. Create a submit file that runs `sleep 120`. +1. Instead of a single `queue` statement, write this: + + :::file + jobbatchname = 1 + queue 5 + +1. Submit the submit file to HTCondor. +1. Now, quickly edit the submit file to instead say: + + :::file + jobbatchname = 2 + +1. Submit the file again. + +Check on the submissions using a normal `condor_q` and `condor_q -nobatch`. Of course, your special attribute does not appear in the `condor_q -nobatch` output, but it is present in the `condor_q` output and in each job’s ClassAd. You can see the effect of the attribute by limiting your `condor_q` output to one type of job or another. First, run this command: + +``` console +username@ap1 $ condor_q -constraint 'JobBatchName == "1"' +``` + +Do you get the output that you expected? Using the example command above, how would you list your other five jobs? +(There will be more on how to use HTCondor constraints in later exercises.) diff --git a/docs/materials/htcondor/part2-ex3-queue-from.md b/docs/materials/htcondor/part2-ex3-queue-from.md new file mode 100644 index 0000000..700ddfb --- /dev/null +++ b/docs/materials/htcondor/part2-ex3-queue-from.md @@ -0,0 +1,135 @@ +--- +status: testing +--- + + + +HTC Exercise 2.3: Submit with “queue from” +============================================= + +Exercise Goals +--------------- + +In this exercise and the next one, you will **explore more ways to use a single +submit file to submit many jobs**. The goal of this exercise is to submit many +jobs from a single submit file by using the `queue ... from` syntax to read +variable values from a file. + + +Background +---------- + +In all cases of submitting many jobs from a single submit file, the key questions are: + +- What makes each job unique? In other words, there is one job per \_\_\_\_\_? +- So, how should you tell HTCondor to distinguish each job? + +For `queue *N*`, jobs are distinguished simply by the built-in "process" variable. But with the remaining `queue` forms, you help HTCondor distinguish jobs by other, more meaningful *custom* variables. + +Counting Words in Files +----------------------- + +Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. As mentioned in the lecture, HTCondor provides many ways to submit jobs for this task. You could create a separate submit file for each book, and submit all of the files manually, but you'd have a lot of file lines to modify each time (in particular, all five of the last lines before `queue` below): + +``` file +executable = freq.py +request_memory = 1GB +request_disk = 20MB +should_transfer_files = YES +when_to_transfer_output = ON_EXIT + +transfer_input_files = AAiW.txt +arguments = AAiW.txt +output = AAiW.out +error = AAiW.err +log = AAiW.log +queue +``` + +This would be overly verbose and tedious. Let's do better. + +Queue Jobs From a List of Values +-------------------------------- + +Suppose we want to modify our word-frequency analysis from a previous exercise so that it outputs only the most common *N* words of a document. However, we want to experiment with different values of *N*. + +For this analysis, we will have a new version of the word-frequency counting +script. First, we need a new version of the word counting program so that it +accepts an extra number as a command line argument and outputs only that many +of the most common words. Here is the new code (it's still not important that +you understand this code): + +``` python +#!/usr/bin/env python3 + +import os +import sys +import operator + +if len(sys.argv) != 3: + print(f'Usage: {os.path.basename(sys.argv[0])} DATA NUM_WORDS') + sys.exit(1) +input_filename = sys.argv[1] +num_words = int(sys.argv[2]) + +words = {} + +with open(input_filename, 'r') as my_file: + for line in my_file: + line_words = line.split() + for word in line_words: + if word in words: + words[word] += 1 + else: + words[word] = 1 + +sorted_words = sorted(words.items(), key=operator.itemgetter(1)) +for word in sorted_words[-num_words:]: + print(f'{word[0]} {word[1]:8d}') +``` + +To submit this program with a collection of two variable values for each run, one for the number of top words and one for the filename: + +1. Save the script as `wordcount-top-n.py`. +1. Download and unpack some books from Project Gutenberg: + + :::console + user@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/books.zip + user@ap1 $ unzip books.zip + +1. Create a new submit file (or base it off a previous one!) named `wordcount-top.sub`, including memory and disk requests of 20 MB. +1. All of the jobs will use the same `executable` and `log` statements. +1. Update other statements to work with two variables, `book` and `n`: + + :::file + output = $(book)_top_$(n).out + error = $(book)_top_$(n).err + transfer_input_files = $(book) + arguments = "$(book) $(n)" + queue book, n from books_n.txt + + Note especially the changes to the `queue` statement; it now tells HTCondor to read a separate text file of ***pairs*** of values, which will be assigned to `book` and `n` respectively. + +1. Create the separate text file of job variable values and save it as `books_n.txt`: + + :::file + AAiW.txt, 10 + AAiW.txt, 25 + AAiW.txt, 50 + PandP.txt, 10 + PandP.txt, 25 + PandP.txt, 50 + TAoSH.txt, 10 + TAoSH.txt, 25 + TAoSH.txt, 50 + + Note that we used 3 different values for *n* for each book. + +1. Submit the file +1. Do a quick sanity check: How many jobs were submitted? How many log, output, and error files were created? + +Extra Challenge 1 +----------------- + +You may have noticed that the output of these jobs has a messy naming convention. Because our macros resolve to the filenames, including their extension (e.g., `AAiW.txt`), the output filenames contain with multiple extensions (e.g., `AAiW.txt.err`). Although the extra extension is acceptable, it makes the filenames harder to read and possibly organize. *Change your submit file and variable file for this exercise so that the output filenames do not include the `.txt` extension.* + diff --git a/docs/materials/htcondor/part2-ex4-queue-matching.md b/docs/materials/htcondor/part2-ex4-queue-matching.md new file mode 100644 index 0000000..ca38a46 --- /dev/null +++ b/docs/materials/htcondor/part2-ex4-queue-matching.md @@ -0,0 +1,124 @@ +--- +status: testing +--- + + + +Bonus HTC Exercise 2.4: Submit With “queue matching” +================================================= + +Exercise Goal +--------------- + +The goal of this exercise is to submit many jobs from a single submit file by using the `queue ... matching` syntax to submit jobs with variable values derived from files in the current directory which match a specified pattern. + +Counting Words in Files +----------------------- + +Returning to our book word-counting example, let's pretend that instead of +three books, we have an entire library. While we could list all of the text +files in a `books.txt` file and use `queue book from books.txt`, it could be a +tedious process, especially for tens of thousands of files. Luckily HTCondor +provides a mechanism for submitting jobs based on pattern-matched files. + +Queue Jobs By Matching Filenames +-------------------------------- + +This is an example of a common scenario: We want to run one job per file, where the filenames match a certain consistent pattern. The `queue ... matching` statement is made for this scenario. + +Let’s see this in action. First, here is a new version of the script (note, we removed the 'top n words' restriction): + +``` python +#!/usr/bin/env python3 + +import os +import sys +import operator + +if len(sys.argv) != 2: + print(f'Usage: {os.path.basename(sys.argv[0])} DATA') + sys.exit(1) +input_filename = sys.argv[1] + +words = {} + +with open(input_filename, 'r') as my_file: + for line in my_file: + line_words = line.split() + for word in line_words: + if word in words: + words[word] += 1 + else: + words[word] = 1 + +sorted_words = sorted(words.items(), key=operator.itemgetter(1)) +for word in sorted_words: + print(f'{word[0]} {word[1]:8d}') +``` + +To use the script: + +1. Create and save this script as `wordcount.py`. +1. Verify the script by running it on one book manually. +1. Create a new submit file to submit one job (pick a book file and model your submit file off of the one above) +1. Modify the following submit file statements to work for all books: + + :::text + transfer_input_files = $(book) + arguments = $(book) + output = $(book).out + error = $(book).err + queue book matching *.txt + + !!!note + As always, the order of statements in a submit file does not matter, except that the `queue` statement should be last. Also note that any submit file variable name (here, `book`, but true for `process` and all others) may be used in any mixture of upper- and lowercase letters. + +1. Submit the jobs. + +HTCondor uses the `queue ... matching` statement to look for files in the submit directory that match the given pattern, then queues one job per match. For each job, the given variable (e.g., `book` here) is assigned the name of the matching file, so that it can be used in `output`, `error`, and other statements. + +The result is the same as if we had written out a much longer submit file: + +``` file +... + +transfer_input_files = AAiW.txt +arguments = "AAiW.txt" +output = AAiW.txt.out +error = AAiW.txt.err +queue + +transfer_input_files = PandP.txt +arguments = "PandP.txt" +output = PandP.txt.out +error = PandP.txt.err +queue + +transfer_input_files = TAoSH.txt +arguments = "TAoSH.txt" +output = TAoSH.txt.out +error = TAoSH.txt.err +queue + +... +``` + +How many jobs were created? Is this what you expected? If you ran this in the +same directory as Exercise 2.3, you may have noticed that a job was submitted +for the `books_n.txt` file that holds the variable values in the `queue from` +statement. Beware the dangers of matching more files than intended! One +solution may be to put all of the books into an `books` directory and `queue +matching books/*.txt`. Can you think of other solutions? If you have time, try one! + +Extra Challenge 1 +----------------- + +In the example above, you used a single log file for all three jobs. HTCondor handles this situation with no problem; each job writes its events into the log file without getting in the way of other events and other jobs. But as you may have seen, it may be difficult for a person to understand the events for any particular job in the combined log file. + +Create a new submit file that works just like the one above, except that each job writes its own log file. + +Extra Challenge 2 +----------------- + +Between this exercise and the previous one, you have explored two of the three primary `queue` statements. How would you use the `queue in ... list` statement to accomplish the same thing(s) as one or both of the exercises? + diff --git a/docs/materials/index.md b/docs/materials/index.md new file mode 100644 index 0000000..db9953f --- /dev/null +++ b/docs/materials/index.md @@ -0,0 +1,193 @@ +# OSG School Materials + +## School Overview and Intro + +View the slides: +[PDF](welcome/files/osgus23-day1-part1-welcome-timc.pdf) + +## Intro to HTC and HTCondor Job Execution + +### Intro to HTC Slides + +Intro to HTC: [PDF](htcondor/files/osgus23-intro-to-htc.pdf) + +Worksheet: [PDF](htcondor/files/osgus23-htc-worksheet.pdf) + +### Intro to HTCondor Slides + +View the slides: [PDF](htcondor/files/osgus23-htc-htcondor.pdf) + + +### Intro Exercises 1: Running and Viewing Simple Jobs (Strongly Recommended) + +- [Exercise 1.1: Log in to the local submit machine and look around](htcondor/part1-ex1-login) +- [Exercise 1.2: Experiment with HTCondor commands](htcondor/part1-ex2-commands.md) +- [Exercise 1.3: Run jobs!](htcondor/part1-ex3-jobs.md) +- [Exercise 1.4: Read and interpret log files](htcondor/part1-ex4-logs.md) +- [Exercise 1.5: Determining Resource Needs](htcondor/part1-ex5-request.md) +- [Exercise 1.6: Remove jobs from the queue](htcondor/part1-ex6-remove.md) + +### Bonus Exercises: Job Attributes and Handling + +- [Bonus Exercise 1.7: Compile and run some C code](htcondor/part1-ex7-compile.md) +- [Bonus Exercise 1.8: Explore `condor_q`](htcondor/part1-ex8-queue.md) +- [Bonus Exercise 1.9: Explore `condor_status`](htcondor/part1-ex9-status.md) + +## Intro to HTCondor Multiple Job Execution + +View the Slides ([PDF](htcondor/files/osgus23-htc-htcondor-multiple-jobs.pdf)) + + +### Intro Exercises 2: Running Many HTC Jobs (Strongly Recommended) + +- [Exercise 2.1: Work with input and output files](htcondor/part2-ex1-files.md) +- [Exercise 2.2: Use `queue N`, `$(Cluster)`, and `$(Process)`](htcondor/part2-ex2-queue-n.md) +- [Exercise 2.3: Use `queue from` with custom variables](htcondor/part2-ex3-queue-from.md) +- [Bonus Exercise 2.4: Use `queue matching` with a custom variable](htcondor/part2-ex4-queue-matching.md) + + +## OSG + +View the slides: +[PDF](osg/files/osgus23-day2-part1-osg-timc.pdf) + +### OSG Exercises: Comparing PATh and OSG (Strongly Recommended) + +- [Exercise 1.1: Log in to the OSPool Access Point](osg/part1-ex1-login-scp.md) +- [Exercise 1.2: Running jobs in the OSPool](osg/part1-ex2-submit-osg.md) +- [Exercise 1.3: Hardware differences between PATh and OSG](osg/part1-ex3-hardware-diffs.md) +- [Exercise 1.4: Software differences in OSPool](osg/part1-ex4-software-diffs.md) + +## Troubleshooting + +Slides: ([PDF](troubleshooting/files/OSGUS2023_troubleshooting.pdf), +[PowerPoint](troubleshooting/files/OSGUS2023_troubleshooting.pptx)) + +### Troubleshooting Exercises: + +- [Exercise 1.1: Troubleshooting Jobs](troubleshooting/part1-ex1-troubleshooting.md) +- [Exercise 1.2: Job Retry](troubleshooting/part1-ex2-job-retry.md) + +## Software + +Slides: [PDF](software/files/osgus23-software.pdf) + +### Software Exercises 1: Exploring Containers + +- [Exercise 1.1: Run and Explore Apptainer Containers](software/part1-ex1-run-apptainer.md) +- [Exercise 1.2: Use Apptainer Containers in OSPool Jobs](software/part1-ex2-apptainer-jobs.md) +- [Exercise 1.3: Use Docker Containers in OSPool Jobs](software/part1-ex3-docker-jobs.md) +- [Exercise 1.4: Build, Test, and Deploy an Apptainer Container](software/part1-ex4-apptainer-build.md) +- [Exercise 1.5: Choose Software Options](software/part1-ex5-pick-an-option.md) + +### Software Exercises 2: Preparing Scripts +- [Exercise 2.1: Build an HTC-Friendly Executable](software/part2-ex1-build-executable.md) + +### Software Exercises 3: Container Examples (Optional) + +- [Exercise 3.1: Create an Apptainer Definition Files](software/part3-ex1-apptainer-recipes.md) +- [Exercise 3.2: Build Your Own Docker Container](software/part3-ex2-docker-build.md) + +### Software Exercises 4: Exploring Compiled Software (Optional) + +- [Exercise 4.1: Download and Use Compiled Software](software/part4-ex1-download.md) +- [Exercise 4.2: Use a Wrapper Script To Run Software](software/part4-ex2-wrapper.md) +- [Exercise 4.3: Using Arguments With Wrapper Scripts](software/part4-ex3-arguments.md) + +### Software Exercises 5: Compiled Software Examples (Optional) + +- [Exercise 5.1: Compiling a Research Software](software/part5-ex1-prepackaged.md) +- [Exercise 5.2: Compiling Python and Running Jobs](software/part5-ex2-python.md) +- [Exercise 5.3: Using Conda Environments](software/part5-ex3-conda.md) +- [Exercise 5.4: Compiling and Running a Simple Code](software/part5-ex4-compiling.md) + + +## Data + +View the slides +([PDF](data/files/osgus23-data.pdf), +[PowerPoint](data/files/osgus23-data.pptx)) + +### Data Exercises 1: HTCondor File Transfer (Strongly Recommended) + +- [Exercise 1.1: Understanding a job's data needs](data/part1-ex1-data-needs.md) +- [Exercise 1.2: transfer\_input\_files, transfer\_output\_files, and remaps](data/part1-ex2-file-transfer.md) +- [Exercise 1.3: Splitting input](data/part1-ex3-blast-split.md) + +### Data Exercises 2: Using OSDF (Strongly Recommended) + +- [Exercise 2.1: OSDF for inputs](data/part2-ex1-osdf-inputs.md) +- [Exercise 2.2: OSDF for outputs](data/part2-ex2-osdf-outputs.md) + + +## Scaling Up + +View the slides +([PDF](scaling/files/osgus23-scaling-out.pdf)) + +### Scaling Up Exercises + +- [Exercise 1.1: Organizing HTC workloads](scaling/part1-ex1-organization.md) +- [Exercise 1.2: Investigating Job Attributes](scaling/part1-ex2-job-attributes.md) +- [Exercise 1.3: Getting Job Information from Log Files](scaling/part1-ex3-log-files.md) + + +## Workflows with DAGMan + +View the slides +([PDF](workflows/files/osgus23-dagman.pdf), +[PowerPoint](workflows/files/osgus23-dagman.pptx)) + +### DAGMan Exercises 1 + +- [Exercise 1.1: Coordinating set of jobs: A simple DAG](workflows/part1-ex1-simple-dag.md) +- [Exercise 1.2: A brief detour through the Mandelbrot set](workflows/part1-ex2-mandelbrot.md) +- [Exercise 1.3: A more complex DAG](workflows/part1-ex3-complex-dag.md) +- [Exercise 1.4: Handling jobs that fail with DAGMan](workflows/part1-ex4-failed-dag.md) +- [Exercise 1.5: Workflow Challenges](workflows/part1-ex5-challenges.md) + +## Extra Topics + + + +### Self-checkpointing for long-running jobs + +View the slides +([PDF](checkpoint/files/OSGUS2023_checkpointing.pdf),[PPT](checkpoint/files/OSGUS2023_checkpointing.pptx)) + +- [Exercise 1.1: Trying out self-checkpointing](checkpoint/part1-ex1-checkpointing.md) + + +### Special Environments + +View the slides +([PDF](special/files/osgus23-special.pdf), +[PowerPoint](special/files/osgus23-special.pptx)) + +### Special Environments Exercises 1 + +- [Exercise 1.1: GPUs](special/part1-ex1-gpus.md) + +### Introduction to Research Computing Facilitation + +View the slides: [PDF](facilitation/files/osgus23-facilitation-campuses.pdf) + +## Final Talks + +* Philosophy: (slides coming soon) +* Final thoughts: [PDF](final/files/osgus23-day5-part6-forward-timc.pdf) diff --git a/docs/materials/osg/files/osgus23-day2-part1-osg-timc.pdf b/docs/materials/osg/files/osgus23-day2-part1-osg-timc.pdf new file mode 100644 index 0000000..d6cc450 Binary files /dev/null and b/docs/materials/osg/files/osgus23-day2-part1-osg-timc.pdf differ diff --git a/docs/materials/osg/part1-ex1-login-scp.md b/docs/materials/osg/part1-ex1-login-scp.md new file mode 100644 index 0000000..7f4bdff --- /dev/null +++ b/docs/materials/osg/part1-ex1-login-scp.md @@ -0,0 +1,149 @@ +--- +status: testing +--- + +# OSG Exercise 1.1: Log In to the OSPool Access Point + +The main goal of this exercise is to log in to an Open Science Pool Access Point +so that you can start submitting jobs into the OSPool. +But before doing that, you will first prepare a file on Monday‘s Access Point to copy to the OSPool Access Point. +Then you will learn how to efficiently copy files between the Access Points. + +If you have trouble getting `ssh` access to the OSPool Access Point, ask the instructors right away! +Gaining access is critical for all remaining exercises. + +## Part 1: On the PATh Access Point + +The first few sections below are to be completed on `ap1.facility.path-cc.io`, the PATh Access Point. +This is still the same Access Point you have been using since yesterday. + +## Preparing files for transfer + +When transferring files between computers, it’s best to limit the number of files as well as their size. +Smaller files transfer more quickly and, if your network connection fails, +restarting the transfer is less painful than it would be if you were transferring large files. + +Archiving tools (WinZip, 7zip, Archive Utility, etc.) can compress the size of your files +and place them into a single, smaller archive file. +The Unix `tar` command is a one-stop shop for creating, extracting, and viewing the contents of `tar` archives +(called *tarballs*). +Its usage is as follows: + +- To **create** a tarball named `` containing ``, use the following command: + + :::console + $ tar -czvf + + Where `` should end in `.tar.gz` and `` can be a list of any number of files + and/or folders, separated by spaces. + +- To **extract** the files from a tarball into the current directory: + + :::console + $ tar -xzvf + +- To **list** the files within a tarball: + + :::console + $ tar -tzvf + +### Comparing compressed sizes + +You can adjust the level of compression of `tar` by prepending your command with `GZIP=--`, where +`` can be either `fast` for the least compression, or `best` for the most compression (the default +compression is between `best` and `fast`). + +While still logged in to `ap1.facility.path-cc.io`: + +1. Create and change into a new folder for this exercise, for example `osg-ex11` +1. Use `wget` to download the following files from our web server: + 1. Text file: + 1. Archive: + 1. Image: +1. Use `tar` on each file and use `ls -l` to compare the sizes of the original file and the compressed version. + +Which files were compressed the least? Why? + +## Part 2: On the Open Science Pool Access Point + +For many of the remaining exercises, you will be using an OSPool Access Point, +`ap40.uw.osg-htc.org`, +which submits jobs into the OSPool. + +To log in to the OSPool Access Point, +use the same username (and SSH key, if you did that) as on `ap1`. +If you have any issues logging in to `ap40.uw.osg-htc.org`, +please ask for help right away! + +So please `ssh` in to the server and take a look around: + +1. Log in using `ssh USERNAME@ap40.uw.osg-htc.org` (substitute your own username) +1. Try some Linux and HTCondor commands; for example: + * Linux commands: `hostname`, `pwd`, `ls`, and so on + * What is the operating system? `uname` and (in this case) `cat /etc/redhat-release` + * HTCondor commands: `condor_version`, `condor_q`, `condor_status -total` + +## Transferring files + +In the next exercise, you will submit the same kind of job as in the previous exercise. +Wouldn’t it be nice to copy the files instead of starting from scratch? +And in general, being able to copy files between servers is helpful, so let’s explore a way to do that. + +### Using secure copy + +[Secure copy](https://en.wikipedia.org/wiki/Secure_copy) (`scp`) is a command based on SSH +that lets you securely copy files between two different servers. +It takes similar arguments to the Unix `cp` command but also takes additional information about servers. +Its general form is like this: + +```console +scp ... [username@]: +``` + +`` may be omitted if you want to copy your sources to your remote home directory +and `[username@]` may be omitted if your usernames are the same across both servers. +For example, if you are logged in to `ap40.uw.osg-htc.org` +and wanted to copy the file `foo` from your current directory +to your home directory on `ap1.facility.path-cc.io`, +and if your usernames are the same on both servers, +the command would look like this: + +```console +$ scp foo ap1.facility.path-cc.io: +``` + +Additionally, you could *pull* files from `ap1.facility.path-cc.io` to `ap40.uw.osg-htc.org`. +The following command copies `bar` from your home directory on `ap1.facility.path-cc.io` +to your current directory on `ap40.uw.osg-htc.org`; +and in this case, the username for `ap1` is specified: + +``` console +$ scp USERNAME@ap1.facility.path-cc.io:bar . +``` + +Also, you can copy folders between servers using the `-r` option. +If you kept all your files from the HTCondor exercise 1.3 in a folder named `htc-1.3` on `ap1.facility.path-cc.io`, +you could use the following command to copy them to your home directory on `ap40.uw.osg-htc.org`: + +``` console +$ scp -r USERNAME@ap1.facility.path-cc.io:htc-1.3 . +``` + +### Secure copy to your laptop + +During your research, you may need to transfer output files +from your submit server to inspect them on your personal computer, +which can also be done with `scp`! +To use `scp` on your laptop, follow the instructions relevant to your computer‘s operating system: + +#### Mac and Linux users + +`scp` should be included by default and available via the terminal on both Mac and Linux operating systems. + +#### Windows users + +WinSCP is an `scp` client for Windows operating systems. Install WinSCP from + +# Next exercise + +Once completed, move onto the next exercise: [Running jobs in the OSG](part1-ex2-submit-osg.md) diff --git a/docs/materials/osg/part1-ex2-submit-osg.md b/docs/materials/osg/part1-ex2-submit-osg.md new file mode 100644 index 0000000..a54a8f5 --- /dev/null +++ b/docs/materials/osg/part1-ex2-submit-osg.md @@ -0,0 +1,93 @@ +--- +status: testing +--- + +# OSG Exercise 1.2: Running Jobs in OSPool + +The goal of this exercise is to map the physical locations of some Execution Points in the OSPool. +We will provide the executable and associated data, +so your job will be to write a submit file that queues multiple jobs. +Once complete, you will manually collate the results. + +## Where in the world are my jobs? + +To find the physical location of the computers your jobs our running on, you will use a method called *geolocation*. +Geolocation uses a registry to match a computer’s network address to an approximate latitude and longitude. + +### Geolocating several Execution Points + +Now, let’s try to remember some basic HTCondor ideas from the HTC exercises: + +1. Log in to `ap40.uw.osg-htc.org` if you have not yet. +1. Create and change into a new folder for this exercise, for example `osg-ex12` +1. Download the geolocation code: + + :::console + $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool21/location-wrapper.sh \ + http://proxy.chtc.wisc.edu/SQUID/osgschool21/wn-geoip.tar.gz + + You will be using `location-wrapper.sh` as your executable and `wn-geoip.tar.gz` as an input file. + +1. Create a submit file that queues **fifty** jobs that run `location-wrapper.sh`, + transfers `wn-geoip.tar.gz` as an input file, + and uses the `$(Process)` macro to write different `output` and `error` files. + Also, add the following requirement to the submit file (it’s not important to know what it does): + + requirements = (HAS_CVMFS_oasis_opensciencegrid_org == TRUE) && (IsOsgVoContainer =!= True) + + Try to do this step without looking at materials from the earlier exercises. + But if you are stuck, see [HTC Exercise 2.2](../htcondor/part2-ex2-queue-n.md). + +1. Submit your jobs and wait for the results + +### Collating your results + +Now that you have your results, it’s time to summarize them. +Rather than inspecting each output file individually, +you can use the `cat` command to print the results from all of your output files at once. +If all of your output files have the format `location-#.out` (e.g., `location-10.out`), +your command will look something like this: + +``` console +$ cat location-*.out +``` + +The `*` is a wildcard so the above cat command runs on all files that start with `location-` and end in `.out`. +Additionally, you can use `cat` in combination with the `sort` and `uniq` commands using "pipes" (`|`) +to print only the unique results: + +``` console +$ cat location-*.out | sort | uniq +``` + +## Mapping your results + +To visualize the locations of the Execution Points that your jobs ran on, +you will be using . +Copy and paste the collated results into the text box that pops up +when clicking on the 'Bulk Entry' button on the right-hand side. +Where did your jobs run? + +## Next exercise + +Once completed, move onto the next exercise: [Hardware Differences in the OSG](part1-ex3-hardware-diffs.md) + +## Extra Challenge: Cleaning up your submit directory + +If you run `ls` in the directory from which you submitted your job, you may see that you now have thousands of files! +Proper data management starts to become a requirement as you start to develop true HTC workflows; +it may be helpful to separate your submit files, code, and input data from your output data. + +1. Try editing your submit file so that all your output and error files are saved to separate directories within your + submit directory. + + !!! note "Tip" + Experiment with fewer job submissions until you’re confident you have it right, + then go back to submitting 500 jobs. + Remember: Test small and scale up! + +1. Submit your file and track the status of your jobs. + +Did your jobs complete successfully with output and error files saved in separate directories? +If not, can you find any useful information in the job logs or hold messages? +If you get stuck, review the [slides from Tuesday](../index.md). diff --git a/docs/materials/osg/part1-ex3-hardware-diffs.md b/docs/materials/osg/part1-ex3-hardware-diffs.md new file mode 100644 index 0000000..3dd670f --- /dev/null +++ b/docs/materials/osg/part1-ex3-hardware-diffs.md @@ -0,0 +1,136 @@ +--- +status: testing +--- + +# OSG Exercise 1.3: Hardware Differences Between PATh and OSG + +The goal of this exercise is to compare hardware differences between the Monday cluster +(the PATh Facility) and the Open Science Pool. +Specifically, we will look at how easy it is to get access to resources +in terms of the amount of memory that is requested. +This will not be a very careful study, +but should give you some idea of one way in which the pools are different. + +In the first two parts of the exercise, +you will submit batches of jobs that differ only in how much memory each one requests. +This is called this a *parameter sweep*, in that we are testing many possible values of a parameter. +We will request memory from 8–64 GB, doubling the memory each time. +One set of jobs will be submitted to the PATh Facility, and the other, identical set of jobs will be submitted to the OSPool. +You will check the queue periodically to see how many jobs have completed and how many are still waiting to run. + +## Checking PATh memory availability + +In this first part, you will create the submit file that will be used for both the PATh and OSPool jobs, +then submit the PATh set. + +### Yet another queue syntax + +Earlier, you learned about the `queue` statement +and some of the different ways it can be invoked to submit multiple jobs. +Similar to the `queue from` statement to submit jobs based on lines from a specific file, +you can use `queue in` to submit jobs based on a list that is written directly in your submit file: + +``` +queue <# of jobs> in ( + + + +... +) +``` + +For example, to submit 6 total jobs that sleep for `5`, `5`, `10`, `10`, `15`, and `15` seconds, +you could write the following submit file: + +``` +executable = /bin/sleep + +request_cpus = 1 +request_memory = 1MB +request_disk = 1MB + +queue 2 arguments in ( +5 +10 +15 +) +``` + +Try submitting this yourself and verify that all six jobs are in the queue, +using the `condor_q -nobatch` command. + +### Create the submit file + +To create our parameter sweep, +we will create a **new** submit file with the queue…in syntax +and change the value of our parameter (`request_memory`) for each batch of jobs. + +1. Log in or switch back to `ap1.facility.path-cc.io` (yes, back to PATh!) +1. Create and change into a new subdirectory called `osg-ex14` +1. Create a submit file named `sleep.sub` that executes the command `/bin/sleep 300`. + + !!! note + If you do not remember all of the submit statements to write this file, or just to go faster, + find a similar submit file from a previous exercise. + Copy the file and rename it here, and make sure the argument to `sleep` is `300`. + +1. Use the queue…in syntax to submit 10 jobs *each* for the following memory requests: 8, 16, 32, and 64 GB. + There will be 40 jobs total: 10 jobs requesting 8 GB, 10 requesting 16 GB, etc. +1. Submit your jobs + +### Monitoring the local jobs + +Every few minutes, run `condor_q` and see how your sleep jobs are doing. +To display the number of jobs remaining for each `request_memory` parameter specified, +run the following command: + +``` console +$ condor_q -af RequestMemory | sort -n | uniq -c +``` + +The numbers in the left column are the number of jobs left of that type +and the number on the right is the amount of memory you requested, in MB. +Consider making a little table like the one below to track progress. + +| Memory | Remaining \#1 | Remaining \#2 | Remaining \#3 | +|:-------|:--------------|:--------------|:--------------| +| 8 GB | 10 | 6 | | +| 16 GB | 10 | 7 | | +| 32 GB | 10 | 8 | | +| 64 GB | 10 | 9 | | + +In the meantime, between checking on your local jobs, start the next section – +but take a break every few minutes to switch back to `ap1` and record progress on your PATh jobs. + +## Checking OSPool memory availability + +Now you will do essentially the same thing on the OSPool. + +1. Log in or switch to `ap40.uw.osg-htc.org` + +1. Copy the `osg-ex14` directory from the [section above](#checking-chtc-memory-availability) + from `ap1.facility.path-cc.io` to `ap40.uw.osg-htc.org` + + If you get stuck during the copying process, refer to [OSG exercise 1.1](part1-ex1-login-scp.md). + +1. Submit the jobs to the OSPool + +### Monitoring the remote jobs + +As you did in the first part, use `condor_q` to track how your sleep jobs are doing. +It is fine to move on to the next exercise, but keep tracking the status of both sets of these jobs. +After you are done with the [next exercise](part1-ex4-software-diffs.md), +come back to this exercise and analyze the results. + +## Analyzing the results + +Have all of your jobs from this exercise completed on both PATh and the OSPool? +How many jobs have completed thus far on PATh? +How many have completed thus far on the OSPool? + +Due to the dynamic nature of the OSPool, +the demand for higher memory jobs there may have resulted in a temporary increase in high-memory slots there. +That being said, high-memory are a high-demand, low-availability resource in the OSPool +so your 64 GB jobs may have taken longer to run or complete. +On the other hand, PATh has a fair number of 64 GB (and greater) slots +so all your jobs have a high chance of running. diff --git a/docs/materials/osg/part1-ex4-software-diffs.md b/docs/materials/osg/part1-ex4-software-diffs.md new file mode 100644 index 0000000..08ce333 --- /dev/null +++ b/docs/materials/osg/part1-ex4-software-diffs.md @@ -0,0 +1,113 @@ +--- +status: testing +--- + +# OSG Exercise 1.4: Software Differences in OSPool + +The goal of this exercise is to see some differences in the availability of software in the OSPool. +At your local cluster, you may be used to having certain versions of software. +But in the OSPool, +it is likely that the software you need will not be available at all. + +## Comparing operating systems + +To really see differences between Execution Points in the PATh Facility versus the OSPool, +you will want to compare the “machine” ClassAds between the two pools. +Rather than inspecting the very long ClassAd for each Execution Point, +you will look at a specific attribute called `OpSysAndVer`, +which tells us the operating system version of the Execution Point. +An easy way to show this attribute for all Execution Points is by using `condor_status` +in conjunction with the `-autoformat` (or `-af`, for short) option. +The `-autoformat` option is like the `-format` option you learned about earlier, +and outputs the attributes you choose for each slot; +but as you may have guessed, it does some automatic formatting for you. + +So, let’s examine the operating system and (major) version of slots on the PATh Facility and the OSPool. + +1. Log in or switch to `ap1.facility.path-cc.io` and run the following command: + + :::console + $ condor_status -autoformat OpSysAndVer + +1. Log in or switch to `ap40.uw.osg-htc.org` (parallel windows are handy!) + and run the same command + +You will see many values for the operating system and major version. +Some are abbreviated — for example, +`RedHat` stands for “Red Hat Enterprise Linux” and +`SL` stands for “Scientific Linux” (a Red Hat variant). + +The only problem is that with hundreds or thousands of slots, +it's difficult to get a feel for the composition of each pool from this output. +You can use the `sort` and `uniq` commands, in sequence, on the `condor_status` output +to get counts of each unique operating system and version string. +Your command line should look something like this: + +``` console +$ condor_status -autoformat OpSysAndVer | sort | uniq -c +``` + +How would you describe the difference between the PATh Facility and OSPool? + +## Submitting probe jobs + +Now you have some idea of the diversity of operating systems on the OSPool. +This is a step in the right direction to knowing what software is available in general. +But what you really want to know is whether your specific software tool (and version) is available. + +### Software probe code + +The following shell script probes for software and returns the version if it is installed: + +```bash +#!/bin/sh + +get_version(){ + program=$1 + $program --version > /dev/null 2>&1 + double_dash_rc=$? + $program -version > /dev/null 2>&1 + single_dash_rc=$? + which $program > /dev/null 2>&1 + which_rc=$? + if [ $double_dash_rc -eq 0 ]; then + $program --version 2>&1 + elif [ $single_dash_rc -eq 0 ]; then + $program -version 2>&1 + elif [ $which_rc -eq 0 ]; then + echo "$program installed but could not find version information" + else + echo "$program not installed" + fi +} + +get_version 'R' +get_version 'cmake' +get_version 'python' +get_version 'python3' +``` + +If there's a specific command line program that your research requires, feel free to add it to the script! +For example, if you wanted to test for the existence and version of `nslookup`, you would add the following to the end +of the script: + +``` file +get_version 'nslookup' +``` + +### Probing several servers + +For this part of the exercise, try creating a submit file without referring to previous exercises! + +1. Log in or switch to `ap40.uw.osg-htc.org` +1. Create and change into a new folder for this exercise, e.g. `osg-ex14` +1. Save the above script as a file named `sw_probe.sh` +1. Make sure the script can be run: `chmod a+x sw_probe.sh` +1. Try running the script in place to make sure it works: `./sw_probe.sh` +1. Create a submit file that runs `sw_probe.sh` 100 times + and uses macros to write different `output`, `error`, and `log` files +1. Submit your job and wait for the results + +Will you be able to do your research on the OSG with what's available? +Do not worry if it does not seem like you can: +Later today, you will learn how to make your jobs portable enough so that they can run anywhere! diff --git a/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz b/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz new file mode 100644 index 0000000..3a5df3a Binary files /dev/null and b/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz differ diff --git a/docs/materials/scaling/files/osgus23-scaling-out.pdf b/docs/materials/scaling/files/osgus23-scaling-out.pdf new file mode 100644 index 0000000..0c55cc6 Binary files /dev/null and b/docs/materials/scaling/files/osgus23-scaling-out.pdf differ diff --git a/docs/materials/scaling/files/osgus23-scaling-out.pptx b/docs/materials/scaling/files/osgus23-scaling-out.pptx new file mode 100644 index 0000000..f3e2dec Binary files /dev/null and b/docs/materials/scaling/files/osgus23-scaling-out.pptx differ diff --git a/docs/materials/scaling/part1-ex1-organization.md b/docs/materials/scaling/part1-ex1-organization.md new file mode 100644 index 0000000..6acd606 --- /dev/null +++ b/docs/materials/scaling/part1-ex1-organization.md @@ -0,0 +1,135 @@ +# Organizing HTC Workloads + +Imagine you have a collection of books, +and you want to analyze how word usage varies from book to book or author to author. + +This exercise is similar to HTCondor exercise 2.4, +in that it is about counting word frequencies in multiple files. +But the focus here is on organizing the files more effectively on the Access Point, +with an eye to scaling up to a larger HTC workload in the future. + +## Log into an OSPool Access Point + +Make sure you are logged into `ap40.uw.osg-htc.org`. + +## Get Files + +To get the files for this exercise: + +1. Type `wget https://github.com/osg-htc/user-school-2023/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz` to download the tarball. +1. As you learned earlier, expand this tarball file; it will create a `organizing-files` directory. +1. Change to that directory, or create a separate one for this exercise and copy the files in. + +## Our Workload + +We can analyze one book by running the `wordcount.py` script, with the +name of the book we want to analyze: + + :::console + $ ./wordcount.py Alice_in_Wonderland.txt + +Try running the command to see what the output is for the script. +Once you have done that delete the output file created (`rm counts.Alice_in_Wonderland.txt`). + +We want to run this script on all the books we have copies of. + +* What is the input set for this HTC workload? +* What is the output set? + +## Make an Organization Plan + +Based on what you know about the script, inputs, and outputs, +how would you organize this HTC workload in directories (folders) on the Access Point? + +There will also be system and HTCondor files produced when we submit a job — +how would you organize the log, standard output, and standard error files? + +Try making those changes before moving on to the next section of the tutorial. + +## Organize Files + +There are many different ways to organize files; +a simple method that works for most workloads is having a directory for your input files +and a directory for your output files. + +1. Set up this structure on the command line by running: + + :::console + $ mkdir input + $ mv *.txt input/ + $ mkdir output + +2. View the current directory and its subdirectories by using the `ls` command with the *recursive* (`-R`) flag: + + :::console + $ ls -R + README.md books.submit input output wordcount.py + + ./input: + Alice_in_Wonderland.txt Huckleberry_Finn.txt + Dracula.txt Pride_and_Prejudice.txt + + ./output: + +3. Next, create directories for the HTCondor log, standard output, and standard output files (in one directory): + + :::console + $ mkdir logs + $ mkdir errout + +## Submit One Job + +Now we want to submit a test job that uses this organizing scheme, +using just one item in our input set — +in this example, we will use the `Alice_in_Wonderland.txt` file from our `input` directory. + +1. Fill in the incomplete lines of the submit file, as shown below: + + :::console + executable = wordcount.py + arguments = Alice_in_Wonderland.txt + + transfer_input_files = input/Alice_in_Wonderland.txt + transfer_output_files = counts.Alice_in_Wonderland.txt + transfer_output_remaps = "counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt" + + To tell HTCondor the location of the input file, we need to include the input directory. + Also, this submit file uses the `transfer_output_remaps` feature that you learned about; + it will move the output file to the `output` directory by renaming or remapping it. + +1. Next, edit the submit file lines that tell the log, output, and error files where to go: + + :::console + output = logs/job.$(ClusterID).$(ProcID).out + error = errout/job.$(ClusterID).$(ProcID).err + log = errout/job.$(ClusterID).$(ProcID).log + +1. Submit your job and monitor its progress. + +## Submit Multiple Jobs + +Now, you are ready to submit the whole workload. + +1. Create a file with the list of input files (the input set); + here, this is the list of the book files to analyze. + Do this by using the shell `ls` command and redirecting its output to a file: + + :::console + $ ls input > booklist.txt + $ cat booklist.txt + +1. Modify the submit file to reference the file of inputs and replace the fixed value (`Alice_in_Wonderland.txt`) with a variable (`$(book)`): + + :::console + executable = wordcount.py + arguments = $(book) + + transfer_input_files = input/$(book) + transfer_output_files = counts.$(book) + transfer_output_remaps = "counts.$(book)=output/counts.$(book)" + + queue book from booklist.txt + +1. Submit the jobs + +1. When complete, look at the complete set of input and (now) output files to see how they are organized. diff --git a/docs/materials/scaling/part1-ex2-job-attributes.md b/docs/materials/scaling/part1-ex2-job-attributes.md new file mode 100644 index 0000000..6bd39d9 --- /dev/null +++ b/docs/materials/scaling/part1-ex2-job-attributes.md @@ -0,0 +1,112 @@ +# Exercise 1.2: Investigating Job Attributes + +The objective of this exercise is to your awareness of job "class ad attributes", +especially ones that may help you look for issues with your jobs in the OSPool. + +Recall that a job class ad contains attributes and their values that describe what HTCondor knows about the job. +OSPool jobs contain extra attributes that are specific to that pool. +Thus, an OSPool job class ad may have well over 150 attributes. + +Some OSPool job attributes are especially helpful when you are scaling up jobs +and want to see if jobs are running as expected or are maybe doing surprising things that are worth extra attention. + +## Preparing exercise files + +Because this exercise focuses on OSPool job attributes, please use your OSPool account on `ap40.uw.osg-htc.org`. + +1. Create a shell script for testing called `simple.sh`: + + :::file + #!/bin/bash + + SLEEPTIME=$1 + + hostname + pwd + whoami + + for i in {1..5} + do + echo "performing iteration $i" + sleep $SLEEPTIME + done + +1. Create an HTCondor submit file that queues three jobs: + + :::file + universe = vanilla + log = logs/$(Cluster)_$(Process).log + error = logs/$(Cluster)_$(Process).err + output = $(Cluster)_$(Process).out + + executable = simple.sh + + should_transfer_files = YES + when_to_transfer_output = ON_EXIT + + request_cpus = 1 + request_memory = 1GB + request_disk = 1GB + + # set arguments, queue a normal job + arguments = 600 + queue 1 + + # queue a job that will go on hold + transfer_input_files = test.txt + queue 1 + + # queue a job that will never start + request_memory = 40TB + queue 1 + +## Exploring OSPool job class ad attributes + +For this exercise, you will submit the three jobs defined in the submit file above, +then examine their job class ad attributes. + +Here are some attributes that may be interesting: + +* `CpusProvisioned` is the number of CPUs given to your job for the current or most recent run +* `ResidentSetSize_RAW` is the maximum amount of memory that HTCondor has noticed your job using (in KB) +* `DiskUsage_RAW` is the maximum amount of disk that HTCondor has noticed your job using (in MB) +* `NumJobStarts` is the number of times HTCondor has started your job; `1` is typical for a running job, and higher counts may indicate issues running the job +* `LastRemoteHost` identifies the name for the slot where your job is running or most recently ran +* `MachineAttrGLIDEIN_ResourceName*N*` is a set of numbered attributes that identify the most recent sites where your job ran; *N* is `0` for the most recent (or current) run, `1` for the previous run, and so on up to `9` +* `ExitCode` exists only if your job exited (completed) at least once; a value of `0` typically means success +* `HoldReasonCode` exists only if your job went on hold; if so, it is a number corresponding to the main hold reason + (see [here](https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html?highlight=HoldReasonCode#job-classad-attributes) for details) +* `NumHoldsByReason` is a list of all of the main reasons your job has gone on hold so far with counts of each hold type + +Let’s explore these attributes on real jobs. + +1. Submit the jobs (above) and note the cluster ID + +1. When one job from the cluster is running, view all of its job class ad attributes: + + :::console + $ condor_q -l + + where `` is your job's ID, and `-l` stands for `-long` + + This command lists all of the job’s class ad attributes. + Details of some of the attributes are in the + [HTcondor Manual](https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html). + Others are defined (and not well documented) only for the OSPool. + Can you find any of the attributes listed above? + +1. Next, use `condor_q -af ` to examine one attribute at a time for several jobs: + + :::console + $ condor_q -af NumJobStarts + + where `` is the HTCondor cluster ID noted above, and `-af` stands for `-autoformat`. + + What does the output tell you? + +1. Finally, display several attributes at once for the jobs: + + :::console + $ condor_q -af:j NumJobStarts DiskUsage_RAW LastRemoteHost HoldReasonCode + + Why do some values appear as `undefined`? diff --git a/docs/materials/scaling/part1-ex3-log-files.md b/docs/materials/scaling/part1-ex3-log-files.md new file mode 100644 index 0000000..7977963 --- /dev/null +++ b/docs/materials/scaling/part1-ex3-log-files.md @@ -0,0 +1,140 @@ +--- +status: testing +--- + + + +Getting Job Information from Log Files +====================================== + +HTCondor job log files contain useful information about submitted, running, and/or completed jobs, but the format of that information may not always be useful *to you*. Here, we have a few examples of how to use some powerful Unix commands (`grep`, `sort`, `uniq`) to pull information out of these job log files. It is now time for you to try these on your own jobs! + +Before starting this exercise, copy a couple of your job log files from previous exercises (for example, HTC Exercise 1.5 and/or OSG Exercise 1.1) in to a new directory for this exercise. Use these log files in place of `my-job.log` in the examples below. + +The `grep` command displays lines from a file matching a given pattern, where the pattern is the first argument provided to `grep`. For example `grep 'alice' address_book.txt` would print out all lines containing the characters `alice` in the file named `address_book.txt`. While working through this exercise, consider keeping one of your job log files open in a separate window to see if you can figure out how we came up with the patterns presented in this exercise. + +Job terminations +---------------- + +Lines for job termination events in the job log always start with `005` and contain the timestamp of when the job(s) ended. Use the following `grep` command to get a list of when jobs ended in your log files: + +``` console +$ grep '^005' my-job.log +``` + +*Optional challenge*: What is the importance of `^` in the pattern (`^005`) provided above? + +Recall that executables typically exit with code `0` when they exit normally, which often (but not always!) means that they exited successfully. Lines containing jobs' exit codes (i.e. return values) all contain the word `termination`. Use `grep` to get a list of jobs' exit codes: + +``` console +$ grep termination my-job.log +``` + +By "piping" the output of the previous command through the `sort` and then `uniq` commands, we can get a count of each exit code: + +``` console +$ grep termination my-job.log | sort | uniq -c +``` + +Here's an example of the output from the previous commands when run on a log file written to from eight jobs. Six jobs exited with exit code `0`, while two exited `1`: + +``` console +[username@ap40]$ grep '^005' my-job.log +005 (236881.000.000) 2022-07-27 15:07:38 Job terminated. +005 (236883.000.000) 2022-07-27 15:07:42 Job terminated. +005 (236882.000.000) 2022-07-27 15:08:01 Job terminated. +005 (236880.000.000) 2022-07-27 15:08:07 Job terminated. +005 (236891.000.000) 2022-07-27 15:13:31 Job terminated. +005 (236893.000.000) 2022-07-27 15:13:32 Job terminated. +005 (236892.000.000) 2022-07-27 15:13:58 Job terminated. +005 (236890.000.000) 2022-07-27 15:13:59 Job terminated. + +[username@ap40]$ grep 'termination' my-job.log +(1) Normal termination (return value 0) +(1) Normal termination (return value 1) +(1) Normal termination (return value 0) +(1) Normal termination (return value 0) +(1) Normal termination (return value 1) +(1) Normal termination (return value 0) +(1) Normal termination (return value 0) +(1) Normal termination (return value 0) + +[username@ap40]$ grep 'termination' my-job.log | sort | uniq -c +6 (1) Normal termination (return value 0) +2 (1) Normal termination (return value 1) +``` + +Job resource usage +------------------ + +Jobs' resource usages (and requests and allocations) are logged in the following format: + +``` file +Partitionable Resources : Usage Request Allocated + Cpus : 1 1 + Disk (KB) : 10382 1048576 1468671 + Memory (MB) : 692 1024 1024 +``` + +Run the following `grep` command to pull out the memory information from your job logs: + +``` console +$ grep 'Memory (MB) *:' my-job.log +``` + +Look back at the format in the example above. Columns after the `:` will first show memory usage, then memory requested, and then the memory allocated to your job. + +Similarly, use the following command to get the disk information from your job logs: + +``` console +$ grep 'Disk (KB) *:' my-job.log +``` + +Here's some example output from running the memory `grep` command on the same eight-job log file: + +``` console +[username@ap40]$ grep 'Memory (MB) *:' my-job.log +Memory (MB) : 692 1024 1024 +Memory (MB) : 714 1024 1024 +Memory (MB) : 703 1024 1024 +Memory (MB) : 699 1024 1024 +Memory (MB) : 705 1024 1024 +Memory (MB) : 704 1024 1024 +Memory (MB) : 711 1024 1024 +Memory (MB) : 697 1024 1024 +``` + +In this example, the memory usage for the jobs ranged from 692 to 714 MB, and they all requested (and were allocated) 1 GB of memory. + + +Other job information +--------------------- + +See if you can come up with `grep` commands to gather the number of bytes sent and received by jobs (i.e. how much data was transferred to/from the access point). Here is some example output for comparison: + +``` console +[username@ap40]$ grep '' my-job.log +760393 - Total Bytes Sent By Job +760395 - Total Bytes Sent By Job +760397 - Total Bytes Sent By Job +760395 - Total Bytes Sent By Job +760393 - Total Bytes Sent By Job +760395 - Total Bytes Sent By Job +760397 - Total Bytes Sent By Job +760395 - Total Bytes Sent By Job + +[username@ap40]$ grep '' my-job.log +19240 - Total Bytes Received By Job +19240 - Total Bytes Received By Job +19240 - Total Bytes Received By Job +19240 - Total Bytes Received By Job +19240 - Total Bytes Received By Job +19240 - Total Bytes Received By Job +19240 - Total Bytes Received By Job +19240 - Total Bytes Received By Job +``` + +Job log files may also contain additional information about held jobs or interrupted jobs. If you feel that your jobs are bouncing from idle to running and back to idle, or that they are otherwise not making as much progress as you expect, the log files are a good place to check. Though they might eventually become impossibly large to read line-by-line once you start scaling up, using `grep` to pull out specific lines and using `sort` and `uniq` to reduce the output can help you make sense of the information contained in the logs. diff --git a/docs/materials/software/files/osgus23-software.pdf b/docs/materials/software/files/osgus23-software.pdf new file mode 100644 index 0000000..77a1ca4 Binary files /dev/null and b/docs/materials/software/files/osgus23-software.pdf differ diff --git a/docs/materials/software/files/part1-ex1-blast-dl-folder-linux.png b/docs/materials/software/files/part1-ex1-blast-dl-folder-linux.png new file mode 100644 index 0000000..88c7dde Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-dl-folder-linux.png differ diff --git a/docs/materials/software/files/part1-ex1-blast-dl-folder.png b/docs/materials/software/files/part1-ex1-blast-dl-folder.png new file mode 100644 index 0000000..608c4bb Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-dl-folder.png differ diff --git a/docs/materials/software/files/part1-ex1-blast-dl-list-linux.png b/docs/materials/software/files/part1-ex1-blast-dl-list-linux.png new file mode 100644 index 0000000..2088f6e Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-dl-list-linux.png differ diff --git a/docs/materials/software/files/part1-ex1-blast-dl-list.png b/docs/materials/software/files/part1-ex1-blast-dl-list.png new file mode 100644 index 0000000..2afc09f Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-dl-list.png differ diff --git a/docs/materials/software/files/part1-ex1-blast-dl-page.png b/docs/materials/software/files/part1-ex1-blast-dl-page.png new file mode 100644 index 0000000..a9c8655 Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-dl-page.png differ diff --git a/docs/materials/software/files/part1-ex1-blast-front-page.png b/docs/materials/software/files/part1-ex1-blast-front-page.png new file mode 100644 index 0000000..ab1bd75 Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-front-page.png differ diff --git a/docs/materials/software/files/part1-ex1-blast-landing-page.png b/docs/materials/software/files/part1-ex1-blast-landing-page.png new file mode 100644 index 0000000..63f8266 Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-landing-page.png differ diff --git a/docs/materials/software/files/part1-ex1-blast-src-page.png b/docs/materials/software/files/part1-ex1-blast-src-page.png new file mode 100644 index 0000000..eda9810 Binary files /dev/null and b/docs/materials/software/files/part1-ex1-blast-src-page.png differ diff --git a/docs/materials/software/part1-ex1-run-apptainer.md b/docs/materials/software/part1-ex1-run-apptainer.md new file mode 100644 index 0000000..13424e1 --- /dev/null +++ b/docs/materials/software/part1-ex1-run-apptainer.md @@ -0,0 +1,88 @@ +--- +status: testing +--- + + + +Software Exercise 1.1: Run and Explore Containers +============================================================ + +**Objective**: Run a container interactively + +**Why learn this?**: Being able to run a container directly allows you to confirm +what is installed and whether any additional scripts or code will work in the context +of the container. + +Setup +-------- + +Make sure you are logged into `ap40.uw.osg-htc.org`. For this exercise +we will be using Apptainer containers maintained by OSG staff or existing +containers on Docker Hub. + +We will set two environment variables that will help lighten the load on the +Access Point as we work with containers: + + :::console + $ mkdir ~/apptainer_cache + $ export APPTAINER_CACHEDIR=$HOME/apptainer_cache + $ export TMPDIR=$HOME/apptainer_cache + +Exploring Apptainer Containers +------------------- + +First, let's try to run a container from the [OSG-Supported List](https://portal.osg-htc.org/documentation/htc_workloads/using_software/available-containers-list/). + +1. Find the full path for the `ubuntu 20.04` container image. + +1. To run it, use this command: + + :::console + $ apptainer shell /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-ubuntu-20.04:latest + + It may take a few minutes to start - don't worry if this happens. + +1. Once the container starts, the prompt will change to either `Singularity>` or + `Apptainer>`. Run `ls` and `pwd`. Where are you? Do you see your files? + +1. The `apptainer shell` command will automatically connect your home directory to +the running container so you can use your files. + +1. How do we know we're in a different Linux environment? Try printing out the Linux +version, or checking the version of common tools like `gcc` or Python: + + :::console + $ cat /etc/os-release + $ gcc --version + $ python3 --version + +1. Exit out of the container by typing `exit`. + +1. Type the same commands back on the normal Access Point. Should they give the same +results as when typed in the container, or different? + + :::console + $ cat /etc/os-release + $ gcc --version + $ python3 --version + +Exploring Docker Containers +------------------ + +The process for interactively running a Docker container will be very +similar to an apptainer container. The main difference is a `docker://` prefix +before the container's identifying name. + +1. We are going to be using a [Python image from Docker Hub](https://hub.docker.com/_/python). +Click on the "Tags" tab to see all the different versions of this container that exists. + +1. Let's use version `3.10`. To run it interactively, use this command: + + :::console + $ apptainer shell docker://python:3.10 + +1. Once the container starts and the prompt changes, try running similar commands +as above. What version of Linux is used in this container? Does the version of Python +match what you expect, based on the name of the container? + +1. Once done, type `exit` to leave the container. diff --git a/docs/materials/software/part1-ex2-apptainer-jobs.md b/docs/materials/software/part1-ex2-apptainer-jobs.md new file mode 100644 index 0000000..59423ff --- /dev/null +++ b/docs/materials/software/part1-ex2-apptainer-jobs.md @@ -0,0 +1,74 @@ +--- +status: testing +--- + + + +Software Exercise 1.2: Use Apptainer Containers in OSPool Jobs +============================================================ + +**Objective**: Submit a job that uses an existing apptainer container; compare default +job environment with a specific container job environment. + +**Why learn this?**: By comparing a non-container and container job, you'll better +understand what a container can do on the OSPool. This may also be how you end up +submitting your jobs if you can find an existing apptainer container with your software. + + +Default Environment +------------------- + +First, let's run a job without a container to see what the typical job environment is. + +1. Create a bash script with the following lines: + + :::bash + #!/bin/bash + + hostname + cat /etc/os-release + gcc --version + python3 --version + + This will print out the version of Linux on the computer, the version + of `gcc`, a common software compiler, and the version of Python 3. + +1. Make the script executable: + + :::console + $ chmod +x script.sh + +1. Run the script on the Access Point. + + :::console + $ ./script.sh + + What results did you get? + +1. Copy a submit file from a previous OSPool job and edit it so that the +script you just wrote is the executable. + +1. Submit the job and read the standard output file when it completes. What version +of Linux was used for the job? What is the version of `gcc` or Python? + +Container Environment +--------------------- + +Now, let's try running that same script inside a container. + +1. For this job, we will use the OSG-provided Ubuntu "Focal" image, as we did in the previous exercise. The `container_image` submit file option will tell HTCondor to use this container for the job: + + :::file + universe = container + container_image = /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-ubuntu-20.04:latest + +1. Submit the job and read the standard output file when it completes. +What version of Linux was used for the job? What is the version of `gcc`? or Python? + +Experimenting With Other Containers +------------- + +1. Look at the list of OSG-Supported containers: [OSG Supported Containers](https://portal.osg-htc.org/documentation/htc_workloads/using_software/available-containers-list/) + +1. Try submitting a job that uses one of these containers. Change the executable +script to explore different aspects of that container. diff --git a/docs/materials/software/part1-ex3-docker-jobs.md b/docs/materials/software/part1-ex3-docker-jobs.md new file mode 100644 index 0000000..c929a33 --- /dev/null +++ b/docs/materials/software/part1-ex3-docker-jobs.md @@ -0,0 +1,62 @@ +--- +status: testing +--- + + + +Software Exercise 1.3: Use Docker Containers in OSPool Jobs +==================================== + +**Objective**: Create a local copy of a Docker container, use it to submit a job. + +**Why learn this?**: Same as the previous exercise; this may also be how you end up +submitting your jobs if you can find an existing Docker container with your software. + +Create Local Copy of Docker Container +------------------- + +While it is technically possible to use a Docker container directly in a job, +there are some good reasons for converting it to a local Apptainer container first. +We'll do this with the same `python:3.10` Docker container we used in the +[first exercise](../part1-ex1-run-apptainer). + +To convert the Docker container to a local Apptainer container, run: + + :::console + $ apptainer build local-py310.sif docker://python:3.10 + +The first argument after `build` is the name of the new Apptainer container file, the +second argument is what we're building from (in this case, Docker). + +Submit File and Executable +------------------- + +1. Make a copy of your submit file from the [previous container exercise](../part1-ex2-apptainer-jobs) or build from an existing submit file. + +1. Add the following lines to the submit file or modify existing lines to match the lines below: + + :::file + universe = container + container_image = local-py310.sif + +1. Use the same executable as the [previous exercise](../part1-ex2-apptainer-jobs). + +1. Once these steps are done, submit the job. + +Finding Docker Containers +------------- + +There are a lot of Docker containers on Docker Hub, but they are not all +created equal. Anyone can create an account on Docker Hub and share container images there, so it’s important to exercise caution when choosing a container image on Docker Hub. These are some indicators that a container image on Docker Hub is consistently maintained, functional and secure: + +- The container image is updated regularly. +- The container image is associated with a well established company, community, or other group that is well-known. +- There is a Dockerfile or other listing of what has been installed to the container image. +- The container image page has documentation on how to use the container image. [^1] + +1. Can you find a container on [Docker Hub](https://hub.docker.com/) that would be +useful for running Jupyter notebooks that use tensorflow? + +1. Does your chosen image meet at least 2 of the criteria above? + +[^1]: This list and previous text taken from [Introduction to Docker](https://carpentries-incubator.github.io/docker-introduction/) \ No newline at end of file diff --git a/docs/materials/software/part1-ex4-apptainer-build.md b/docs/materials/software/part1-ex4-apptainer-build.md new file mode 100644 index 0000000..8e233a7 --- /dev/null +++ b/docs/materials/software/part1-ex4-apptainer-build.md @@ -0,0 +1,135 @@ +--- +status: testing +--- + + + +Software Exercise 1.4: Build, Test, and Deploy an Apptainer Container +==================================== + +**Objective**: to practice building and using a custom +apptainer container + +**Why learn this?**: You may need to go through this process if you +want to use a container for your jobs and can't find one that has +what you need. + +Motivating Script +----------------- + +1. Create a script called `hello-cow.py`: + + :::file + #!/usr/bin/env python3 + + import cowsay + cowsay.cow('Hello OSG User School') + +1. Give it executable permissions: + + :::console + $ chmod +x hello-cow.py + +1. Try running the script: + + :::console + $ ./hello-cow.py + + It will likely fail, because the cowsay library isn't installed. This is a + scenario where we will want to build our own container that includes a base + Python installation and the `cowsay` Python library. + +Preparing a Definition File +--------------------------- + +We can describe our desired Apptainer image in a special format called a +**definition file**. This has special keywords that will direct Apptainer +when it builds the container image. + +1. Create a file called `py-cowsay.def` with these contents: + + :::file + Bootstrap: docker + From: opensciencegrid/osgvo-ubuntu-20.04:latest + + %post + apt-get update -y + apt-get install -y \ + python3-pip \ + python3-numpy + python3 -m pip install cowsay + +Note that we are starting with the same `ubuntu` base we used in previous +exercises. The `%post` statement includes our installation commands, including +updating the `pip` and `numpy` packages, and then using `pip` to install `cowsay`. + +To learn more about definition files, see [Exercise 3.1](../part3-ex1-apptainer-recipes) + +Build the Container +------------------- + +Once the definition file is complete, we can build the container. + +1. Run the following command to build the container: + + :::console + $ apptainer build py-cowsay.sif py-cowsay.def + +As with the Docker image in the [previous exercise](../part1-ex3-docker-jobs), +the first argument is the name to give to the newly create image file and the +second argument is how to build the container image - in this case, the definition file. + + +Testing the Image Locally +------------------- + +1. Do you remember how to interactively test an image? Look back +at [Exercise 1.1](../part1-ex1-run-apptainer) and guess what command would +allow us to test our new container. + +1. Try running: + + :::console + $ singularity shell first-image.sif + +1. Then try running the `hello-cow.py` script: + + :::console + Singularity< ./hello-cow.py + +1. If it produces an output, our container works! We can now exit (by typing `exit`) +and submit a job. + +Submit a Job +-------------- + +1. Make a copy of a submit file from a previous exercise in this section. Can you +guess what options need to be used or modified? + +1. Make sure you have the following (in addition to `log`, `error`, `output` and +CPU and memory requests): + + :::file + universe = container + container_image = py-cowsay.sif + + executable = hello-cow.py + +1. Submit the job and verify the output when it completes. + + :::file + ______________________ + | Hello OSG User School! | + ====================== + \ + \ + ^__^ + (oo)\_______ + (__)\ )\/\ + ||----w | + || || + + diff --git a/docs/materials/software/part1-ex5-pick-an-option.md b/docs/materials/software/part1-ex5-pick-an-option.md new file mode 100644 index 0000000..1e6cfba --- /dev/null +++ b/docs/materials/software/part1-ex5-pick-an-option.md @@ -0,0 +1,77 @@ +--- +status: testing +--- + + + +Software Exercise 1.5 - Choose Software Options +============================================================ + +**Objective**: Decide how you want to make your software portable + +**Why learn this?**: This is the next step to getting your own +research jobs running on the OSPool! + +Know Your Software +------------------ + +Pick at least one software you want to use on the OSPool as a test subject. Then: + +1. Find the download and/or installation page and read through the instructions +and options there. + +1. Is the software available as a binary download, or will you need to run some +kind of command to install it or compile it from source? If there are +multiple download/installation options, which is which? + +1. What pre-requisites does this software need to be installed? + + > Example 1: an R package will require a base R installation + > + > Example 2: some codes require that a library called the "Gnu Scientific + Library (GSL) be already installed on your computer" + +Choose a Strategy +------------------ + +1. Are there any existing containers that contain this software already? + * Explore [OSG-Supported Containers](https://portal.osg-htc.org/documentation/htc_workloads/using_software/available-containers-list/) + * Explore [DockerHub](https://hub.docker.com/), for example: + * [miniconda](https://hub.docker.com/u/continuumio) + * [rocker](https://hub.docker.com/u/rocker) + * [jupyter](https://hub.docker.com/u/jupyter) + * [nvidia](https://hub.docker.com/u/nvidia) + * (and many more!) + If yes, try using this container first, as shown in [Exercise 1.2](../part1-ex2-apptainer-jobs) and [Exercise 1.3](../part1-ex3-docker-jobs) + +1. Is there a simple download or easy compilation process? If so, can you + download the software and use it via a wrapper script? See the exercises from + Part 4 ([Download Software Files](../part4-ex1-download), + [Use a Wrapper Script](../part4-ex2-wrapper), + [Wrapper Script Arguments](../part4-ex3-arguments)). To learn more about using + this approach for specific softwares, see the examples in [Part 5](../../index.html#-software-exercises-5-compiled-software-examples). + +1. Are you using conda? See the specific example in [Exercise 5.3](../part5-ex3-conda) + +1. If neither of the above options works (which may be true for more software!), you + may want to build your own container. + 1. If you want to just use this container on the OSPool, build an + Apptainer container as described in [Exercise 1.4](../part1-ex4-apptainer-build) and + with more information in [Exercise 3.1](../part3-ex1-apptainer-recipes) + 1. If you want to use the container on your own computer or share with + others who would use it on a laptop or desktop, look at the Docker container + example in [Exercise 3.2](../part3-ex2-docker-build). + +Don't do ALL of the software exercises in parts 3 - 5! Instead, choose the section(s) +that makes sense based on how you want to manage your software. Talk to the School +instructors to help make this decision if you are unsure. + +Create an Executable +--------------------- + +Regardless of which approach you use, check out +the [Build an HTC-Friendly Executable](../part2-ex1-build-executable) exercise +for some tips on how to make your script more robust and easy to use with multiple jobs. \ No newline at end of file diff --git a/docs/materials/software/part2-ex1-build-executable.md b/docs/materials/software/part2-ex1-build-executable.md new file mode 100644 index 0000000..9ebc29b --- /dev/null +++ b/docs/materials/software/part2-ex1-build-executable.md @@ -0,0 +1,150 @@ +--- +status: in progress +--- + + + +Software Exercise 2.1 Build an HTC-Friendly Executable +============================================================ + +**Objective**: Modify an existing script to include arguments and headers. + +**Why learn this?**: A little bit of preparation can make it easier to reuse the +same script over and over to run many jobs. + +Setup +------- + +1. Download and unzip a set of Protein Data Bank (PDB) files: + + :::console + $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/alkanes.tar.gz + $ tar -xzf alkanes.tar.gz + +1. For these exercises, we are going to run a command that counts the number of +atoms in the PDB file . Run it now as an example: + + :::console + $ grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb + +Add a Header +---------- + +1. To create a basic script, you can put the command above into a file +called `get_atoms.sh`. To make it clear what language we expect to use to +run the script, we will add the following header on the first line: ``#!/bin/bash` + + :::file + #!/bin/bash + + grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb + + + The "header" of `#!/bin/bash` will tell the computer that this is a bash shell script + and can be run in the same way that you would run individual commands on the command line. + We use `/bin/bash` instead of just `bash` because that is the full path to the `bash` + software file. (Run `which bash` to check!) + + !!! note "Other languages" + We can use the same principle for any scripting language. For example, the header for a Python script + could be either `#!/usr/bin/python3` or `#!/usr/bin/env python3`. Similar logic works + for perl, R, julia and other scripting languages. + +1. Can you now run the script? + + :::console + $ ./get_atoms.sh + +1. This gives "permission denied." Let's add executable permissions to the script +and try again: + + :::console + $ chmod +x get_atoms.sh + $ ./get_atoms.sh + +Incorporate Arguments +---------- + +Can you imagine trying to run this script on all of our pdb files? It would be tedious +to edit it for each one, even for only six inputs. Instead, we should add arguments +to the script to make it easy to reuse the script. +Any information in a script or executable that is going to change or vary across +jobs or analyses should likely be turned into an argument that is specified on the command line. + +1. In our example above, which pieces of the script are likely to change or vary? + +1. The name of the input file (`cubane.pdb`) and output file (`atoms_cubane.pdb`) should +be turned into arguments. Can you envision what our script should look like if we ran it +with input arguments? + +1. Let's say we want to be able to run the following command: + + :::console + $ ./get_atoms.sh cubane.pdb atoms_cubane.pdb + + In order to get arguments from the command line into the script, you have + to use special variables in the script. In bash, these are `$1` (for the first + argument), `$2` (for the second argument) and so on. Try to figure out where + these should go in our `get_atoms.sh` script. + + !!! note "Other Languages" + Each language is going to have its own syntax for reading command line + arguments into the script. In Python, `sys.argv` is a basic method, and more + advanced libraries like `argparse` can be used. In R, the `commandArgs()` function + can do this. Google "command line arguments in ______" to find the right + syntax for your language of choice! + +1. A first pass at adding arguments might look like this: + + :::file + #!/bin/bash + + grep ATOM $1 | wc -l > $2 + + Try running it as described above. Does it work? + +1. While we now have arguments, we have lost some of the readability of our script. The +numbers `$1` and `$2` are not very meaningful in themselves! Let's rewrite the script to +assign the arguments to meaningful variable names: + + :::file + #!/bin/bash + + PDB_INPUT=$1 + PDB_ATOM_OUTPUT=$2 + + grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} + + !!! note "Why curly brackets?" + You'll notice above that we started using curly brackets around our variables. + While you technically don't need them (`$PDB_INPUT` would also be fine), using + them makes the name of the variable (compared to other text) completely clear. + This is especially useful when combining variables with underscores. + +1. There is one final place where we could optimize this script. If we want our output +files to always have the same naming convention, based on the input file name, then +we shouldn't have a separate argument for that -- it's asking for typos. Instead, we +should use variables inside the script to construct the output file name, based on the +input file. That will look like this: + + :::file + #!/bin/bash + + PDB_INPUT=$1 + PDB_ATOM_OUTPUT=atoms_${PDB_INPUT} + + grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} + + You may want to construct other variables, like paths and filenames in this way. But + it depends on how you want to use the script! If we want the flexibility of specifying + a custom output file name, then we should undo this last change so it can be + treated as a separate argument. + +Your Work +---------- + +1. Are you using a scripting language where you could add a header to your main script? +If so, what should it be? + +1. What items in your main code or commands are changing? Do you need to add arguments +to your code? \ No newline at end of file diff --git a/docs/materials/software/part3-ex1-apptainer-recipes.md b/docs/materials/software/part3-ex1-apptainer-recipes.md new file mode 100644 index 0000000..2ed1600 --- /dev/null +++ b/docs/materials/software/part3-ex1-apptainer-recipes.md @@ -0,0 +1,98 @@ +--- +status: testing +--- + + + +Software Exercise 3.1: Create an Apptainer Definition File +============================================================ + +**Objective**: Describe each major section of an Apptainer Definition file. + +**Why learn this?**: When building your own containers, it is helpful to understand +the basic options and syntax of the "build" or definition file. + +| Section | +|---------| +| [Bootstrap/From](#where-to-start) | +| [%files](#files-needed-for-building-or-running) | +| [%files](#commnds-to-install) | +| [%files](#environment) | + + +Where to start +--------------- + + :::file + Bootstrap: docker + From: opensciencegrid/osgvo-ubuntu-20.04:latest + +A custom container always is always built on an existing container. It is common +to use a container on Docker Hub. These lines tell Apptainer to pull the +pre-existing image from Docker Hub, and to use it as the base for the +container that will be built using this definition file. + +When choosing a base container, try to find one that has most of what you need - for +example, if you want to install R packages, try to find a container that already +has R installed. + + +Files needed for building or running +--------------- + + :::file + %files + source_code.tar.gz /opt + install.R + +If you need specific files for the installation (like source code) or +for the job to execute (like small data files or scripts), they can be +copied into the container under the `%files` section. The first item on a line +is what to copy (from your computer) and the optional second item is where it +should be copied in the container. + +Normally the files being copied are in your local working directory where +you run the build command. + +Commands to install +--------------- + + :::file + %post + apt-get update -y + apt-get install -y \ + build-essential \ + cmake \ + g++ \ + r-base-dev + install2.r tidyverse + +This is where most of the installation happens. You can use any shell command here +that will work in the base container to install software. These commands might include: +- Linux installation tools like `apt` or `yum` +- Scripting specific installers like `pip`, `conda` or `install.packages()` +- Shell commands like `tar`, `configure`, `make` + +Different distributions of Linux often have distinct sets of tools for installing software. The installers for various common Linux distributions are listed below: + Ubuntu: apt or apt-get + Debian: deb + CentOS: yum + +A web search for “install X on Y Linux” is usually a good start for common software installation tasks. [^1] + +When installing to a custom location, do *not* install to a `home` directory. This is +likely to get overwritten when the container is run. Instead, `/opt` is the best +directory for custom installations. + +Environment +--------------- + + :::file + %environment + PATH=/opt/mycode/bin:$PATH + JAVA_HOME=/opt/java-1.8 + +To set environment variables (especially useful for software in a custom location), +use the `%environment` section of the definition file. + +[^1]: This text and previous list taken from [Introduction to Docker](https://carpentries-incubator.github.io/docker-introduction/) \ No newline at end of file diff --git a/docs/materials/software/part3-ex2-docker-build.md b/docs/materials/software/part3-ex2-docker-build.md new file mode 100644 index 0000000..57ffa61 --- /dev/null +++ b/docs/materials/software/part3-ex2-docker-build.md @@ -0,0 +1,125 @@ +--- +status: testing +--- + + + +Software Exercise 3.2: Build Your Own Docker Container (Optional) +==================================== + +**Objective**: Build a custom Docker container with `numpy` and use it in a job + +**Why learn this?**: Docker containers can be run on both your laptop and OSPool. DockerHub +also provides a convenient platform for sharing containers. If you want to use a custom +container, run across platforms, and/or share a container amongst a group, building in +Docker first is a good approach. + +Python Script +------------------- + +1. For this example, create a script called `rand_array.py` on the Access Point. + + :::file + import numpy as np + + #numpy array with random values + a = np.random.rand(4,2,3) + + print(a) + +To run this script, we will need a copy of Python with the `numpy` library. +This exercise will walk you through the steps to build your own Docker container +based on Python, with the `numpy` Python library added on. + +Getting Set Up +-------------- + +Before building your own Docker container, you need to go through the following +set up steps: + +1. Install Docker Dekstop on your computer. + * [Docker Desktop page](https://www.docker.com/products/docker-desktop) + +2. You may need to create a Docker Hub user name to download Docker Desktop; if not +created at that step, create a user name for [Docker Hub](https://hub.docker.com/) now. + +3. (Optional): Once Docker is up and running on your computer, you are welcome to take +some time to explore the basics of downloading and running a container, as shown in +the initial sections of this Docker lesson: + * [Introduction to Docker](https://carpentries-incubator.github.io/docker-introduction/) + However, this isn't strictly necessary for building your own container. + +Building a Container +-------------------- + +In order to make our container reproducible, we will be using Docker's capability +to build a container image from a specification file. + +1. First, create an empty build directory on **your computer**, not the Access Points. + +1. In the build directory, create a file called `Dockerfile` (no file extension!) with +the following contents: + + :::file + # Start with this image as a "base". + # It's as if all the commands that created that image were inserted here. + # Always use a specific tag like "4.10.3", never "latest"! + # The version referenced by "latest" can change, so the build will be + # more stable when building from a specific version tag. + FROM continuumio/miniconda3:4.10.3 + + # Use RUN to execute commands inside the image as it is being built up. + RUN conda install --yes numpy + + # RUN multiple commands together. + # Try to always "clean up" after yourself to reduce the final size of your image. + RUN apt-get update \ + && apt-get --yes install --no-install-recommends graphviz \ + && apt-get --yes clean \ + && rm -rf /var/lib/apt/lists/* + + This is our specification file and provides Docker with the information it needs + to build our new container. There are other options besides `FROM` and `RUN`; see + the [Docker documentation](https://docs.docker.com/engine/reference/builder/) for more information. + +1. Note that our container is starting from an existing container +`continuumio/miniconda3:4.10.3`. This container is produced by the `continuumio` +organization; the number `4.10.3` indicates the container version. When we create our +new container, we will want to use a similar naming scheme of: + + USERNAME/CONTAINER:VERSIONTAG + + In what follows, you will want to replace `USERNAME` with your DockerHub user name. + The `CONTAINER` name and `VERSIONTAG` are your choice; in what follows, we will + use `py3-numpy` as the container name and `2021-08` as the version tag. + +1. To build and name the new container, open a command line window on your computer +where you can run Docker commands. Use the `cd` command to change your working directory +to the build directory with the `Dockerfile` inside. + + :::console + $ docker build -t USERNAME/py3-numpy:2021-08 . + + Note the `.` at the end of the command! This indicates that we're using the current + directory as our build environment, including the `Dockerfile` inside. + +Upload Container and Submit Job +------------------------------- + +Right now the container image only exists on your computer. To use it in CHTC or +elsewhere, it needs to be added to a public registry like Docker Hub. + +1. To put your container image in Docker Hub, use the `docker push` command on the +command line: + + :::console + $ docker push USERNAME/py3-numpy:2021-08 + + If the push doesn't work, you may need to run `docker login` first, enter your + Docker Hub username and password and then try the push again. + +1. Once your container image is in DockerHub, you can use it in jobs as described +in [Exercise 1.3](../part1-ex3-docker-jobs). + +> Thanks to [Josh Karpel](https://github.com/JoshKarpel/osg-school-example-dockerfile) for +providing the original sample `Dockerfile`! diff --git a/docs/materials/software/part4-ex1-download.md b/docs/materials/software/part4-ex1-download.md new file mode 100644 index 0000000..622ffa7 --- /dev/null +++ b/docs/materials/software/part4-ex1-download.md @@ -0,0 +1,146 @@ +--- +status: testing +--- + + + +Software Exercise 4.1: Using a Pre-compiled Binary +=================================================== + +**Objective**: Identify software that can be downloaded; download it and use it to run a job. + +**Why learn this?**: Some software doesn't require much "installation" - you can just +download it and run. Recognizing when this is possible can save you time. + +Our Software Example +-------------------- + +The software we will be using for this example is a common tool for +aligning genome and protein sequences against a +reference database, the BLAST program. + +1. Search the internet for the BLAST software. Searches might include +"blast executable or "download blast software". Hopefully these +searches will lead you to a BLAST website page that looks like this: + + ![BLAST landing page](../files/part1-ex1-blast-landing-page.png) + +1. Click on the title that says ["Download +BLAST"](../files/part1-ex1-blast-front-page.png) and then look for the +link that has the [latest installation and source +code](../files/part1-ex1-blast-dl-page.png). + + This will either open a page in a web browser that looks like this: + + ![Download page](../files/part1-ex1-blast-dl-list.png) + + Or you will be asked to open the link in your file browser (choose the + Connect as Guest option): + + ![Download Folder](../files/part1-ex1-blast-dl-folder.png) + + In either case, you should end up on a + page with a list of each version of BLAST that is available for + different operating systems. + +1. We could download the source and compile it ourselves, but instead, +we're going to use one of the pre-built binaries. **Before proceeding, +look at the list of downloads and try to determine which one you want.** + +1. Based on our operating system, we want to use the Linux binary, +which is labelled with the `x64-linux` suffix. + + ![BLAST download page](../files/part1-ex1-blast-dl-list-linux.png) + + ![BLAST download folder](../files/part1-ex1-blast-dl-folder-linux.png) + + All the other links are either for source code or other operating +systems. + +1. On the Access Point, create a directory for +this exercise. Then download the appropriate `tar.gz` file and un-tar/decompress it +it. If you want to do this all from the command line, the sequence will +look like this (using `wget` as the download command.) + + :::console + user@login $ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.0+-x64-linux.tar.gz + user@login $ tar -xzf ncbi-blast-2.14.0+-x64-linux.tar.gz + +1. We're going to be using the `blastx` binary in our job. Where is it +in the directory you just decompressed? + +Copy the Input Files +-------------------- + +To run BLAST, we need an input file and reference database. For this +example, we'll use the "pdbaa" database, which contains sequences for +the protein structure from the Protein Data Bank. For our input file, +we'll use an abbreviated fasta file with mouse genome information. + +1. Download these files to your current directory: + + :::console + username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/pdbaa.tar.gz + username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/mouse.fa + +1. Untar the `pdbaa` database: + + :::console + username@login $ tar -xzf pdbaa.tar.gz + + +Submitting the Job +------------------ + +We now have our program (the pre-compiled `blastx` binary) and our input +files, so all that remains is to create the submit file. The form of a +typical `blastx` command looks something like this: + +```file +blastx -db -query -out +``` + +1. Copy a submit file from one of the Day 1 exercises or previous +software exercises to use for this exercise. + +1. Think about which lines you will need to change or add to your submit +file in order to submit the job successfully. In particular: + - What is the executable? + - How can you indicate the entire command line sequence above? + - Which files need to be transferred in addition to the +executable? + - Does this job require a certain type of operating system? + - Do you have any idea how much memory or disk to request? + +1. Try to answer these questions and modify your submit file +appropriately. + +1. Once you have done all you can, check your submit file against the +lines below, which contain the exact components to run this particular +job. + + * The executable is `blastx`, which is located in the `bin` +directory of our downloaded BLAST directory. We need to use the +`arguments` line in the submit file to express the rest of the command. + + :::file + executable = ncbi-blast-2.13.0+/bin/blastx + arguments = -db pdbaa/pdbaa -query mouse.fa -out results.txt + + * The BLAST program requires our input file and database, so they +must be transferred with `transfer_input_files`. + + :::file + transfer_input_files = pdbaa, mouse.fa + + * Let's assume that we've run this program before, and we know that +1GB of disk and 1GB of memory will be MORE than enough (the 'log' file +will tell us how accurate we are, after the job runs): + + :::file + request_memory = 1GB + request_disk = 1GB + +1. Submit the blast job using `condor_submit`. Once the job starts, it +should run in just a few minutes and produce a file called +`results.txt`. diff --git a/docs/materials/software/part4-ex2-wrapper.md b/docs/materials/software/part4-ex2-wrapper.md new file mode 100644 index 0000000..3ed5386 --- /dev/null +++ b/docs/materials/software/part4-ex2-wrapper.md @@ -0,0 +1,86 @@ +--- +status: testing +--- + + + +Software Exercise 4.2: Writing a Wrapper Script +============================================================ + +**Objective**: Run downloaded software files via an intermediate, "wrapper" script. + +**Why learn this?**: This change is a good test of your general HTCondor knowledge and +how to translate between executable and submit file. Using wrapper scripts is also a +common practice for managing what happens in a job. + +Background +---------- + +Wrapper scripts are a useful tool for running software that can't be compiled into one piece, needs to be installed with every job, or just for running extra steps. A wrapper script can either install the software from the source code, or use an already existing software (as in this exercise). Not only does this portability technique work with almost any kind of software that can be locally installed, it also allows for a great deal of control and flexibility for what happens within your job. Once you can write a script to handle your software (and often your data as well), you can submit a large variety of workflows to a distributed computing system like the Open Science Grid. + +For this exercise, we will write a wrapper script as an alternate way to run the same job as the previous exercise. + +Wrapper Script, part 1 +---------------------- + +Our wrapper script will be a bash script that runs several commands. + +1. In the same directory as the last exercise, make a file called `run_blast.sh`. + +1. The first line we'll place in the script is the basic command for running blast. Based on our previous submit file, what command needs to go into the script? + +1. Once you have an idea, check against the example below: + + :::bash + #!/bin/bash + + ncbi-blast-2.12.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results.txt + + +Submit File Changes +------------------- + +We now need to make some changes to our submit file. + +1. Make a copy of your previous submit file and open it to edit. + +1. Since we are now using a wrapper script, that will be our job's executable. Replace the original `blastx` exeuctable with the name of our wrapper script and comment out the arguments line. + + :::file + executable = run_blast.sh + #arguments = + +1. Note that since the `blastx` program is no longer listed as the executable, it will be need to be included in `transfer_input_files`. Instead of transferring just that program, we will transfer the original downloaded `tar.gz` file. To achieve efficiency, we'll also transfer the pdbaa database as the original `tar.gz` file instead of as the unzipped folder: + + :::console + transfer_input_files = pdbaa.tar.gz, mouse.fa, ncbi-blast-2.12.0+-x64-linux.tar.gz + +1. If you really want to be on top of things, look at the log file for the last exercise, and update your memory and disk requests to be just slightly above the actual "Usage" values in the log. + +1. Before submitting, make sure to make the below additional changes to the wrapper script! + +Wrapper Script, part 2 +---------------------- + +Now that our database and BLAST software are being transferred to the job as `tar.gz` files, our script needs to accommodate. + +1. Opening your `run_blast.sh` script, add two commands at the start to un-tar the BLAST and pdbaa `tar.gz` files. See the [previous exercise](../part1-ex1-download) if you're not sure what these commands looks like. + +1. In order to distinguish this job from our previous job, change the output file name to something besides `results.txt`. + +1. The completed script `run_blast.sh` should look like this: + + :::bash + #!/bin/bash + + tar -xzf ncbi-blast-2.12.0+-x64-linux.tar.gz + tar -xzf pdbaa.tar.gz + + ncbi-blast-2.12.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt + +1. While not strictly necessary, it's a good idea to enable executable permissions on the wrapper script, like so: + + :::console + username@login $ chmod u+x run_blast.sh + +Your job is now ready to submit. Submit it using `condor_submit` and monitor using `condor_q`. diff --git a/docs/materials/software/part4-ex3-arguments.md b/docs/materials/software/part4-ex3-arguments.md new file mode 100644 index 0000000..d412849 --- /dev/null +++ b/docs/materials/software/part4-ex3-arguments.md @@ -0,0 +1,104 @@ +--- +status: testing +--- + + + +Software Exercise 4.3: Passing Arguments Through the Wrapper Script +=================================================== + +**Objective**: Add arguments to a wrapper script to make it more flexible and modular + +**Why learn this?**: Using script arguments will allow you to use the same script for +multiple jobs, by providing different inputs or parameters. These +arguments are normally passed on the command line, but in our world of job +submission, the arguments will be listed in the submit file, in the arguments line. + +Identifying Potential Arguments +------------------------------- + +1. In the same directory as the last exercise, make sure you're in the directory with your +BLAST job submission. + +1. What values might we want to input to the script via arguments? +Hint: anything that we might want to change if we were to run the script +many times. + +In this example, some values we might want to change are the name of the +comparison database, the input file, and the output file. + +Modifying Files +--------------- + +1. We are going to add three arguments to the wrapper script, controlling +the database, input and output file. + +1. Make a copy of your last submit file and open it for editing. Add an +arguments line, or uncomment the one that exists, and add the three input +values mentioned above. + +1. The arguments line in your submit file should look like this: + + :::file + arguments = pdbaa mouse.fa results3.txt + + (We're using `results3.txt`) to distinguish between the previous two runs.) + +1. For bash (the language of our current wrapper +script), the variables `$1`, `$2` and `$3` represent the first, second, +and third arguments, respectively. Thus, in the main command of the script, +replace the various names with these variables: + + :::bash + ncbi-blast-2.12.0+/bin/blastx -db $1/$1 -query $2 -out $3 + + > If your wrapper script is in a different language, you should use + that language's syntax for reading in variables from the command line. + +1. Once these changes are made, submit your jobs with `condor_submit`. +Use `condor_q -nobatch` to see what the job command looks like to +HTCondor. + +It is now easy to change the inputs for the job; we can write them into +the arguments line of the submit file and they will be propagated to the +command in the wrapper script. We can even turn the submit file arguments +into their *own* variables when submitting multiple jobs at once. + +Readability with Variables +--------------- + +One of the downsides of this approach, is that our command has become +harder to read. The original script contains all the information at a glance: + + :::bash + ncbi-blast-2.12.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt + +But our new version is more cryptic -- what is `$1`?: + + :::bash + ncbi-blast-2.10.1+/bin/blastx -db $1 -query $2 -out $3 + +One way to overcome this is to create our own variable names inside the wrapper +script and assign the argument values to them. Here is an example for our +BLAST script: + + :::bash + #!/bin/bash + + DATABASE=$1 + INFILE=$2 + OUTFILE=$3 + + tar -xzf ncbi-blast-2.10.1+-x64-linux.tar.gz + tar -xzf pdbaa.tar.gz + + ncbi-blast-2.10.1+/bin/blastx -db $DATABASE/$DATABASE -query $INFILE -out $OUTFILE + +Here, we are assigning the input arguments (`$1`, `$2` and `$3`) to new variable names, and +then using **those** names (`$DATABASE`, `$INFILE`, and `$OUTFILE`) in the command, +which is easier to read. + +1. Edit your script to match the above syntax. + +1. Submit your jobs with `condor_submit`. When the job finishes, look at the job's +standard output file to see how the variables printed. diff --git a/docs/materials/software/part5-ex1-prepackaged.md b/docs/materials/software/part5-ex1-prepackaged.md new file mode 100644 index 0000000..7732292 --- /dev/null +++ b/docs/materials/software/part5-ex1-prepackaged.md @@ -0,0 +1,142 @@ +--- +status: testing +--- + + + +Software Exercise 5.1: Pre-package a Research Code +========================================== + +**Objective**: Install software (HHMER) to a folder and run it in a job using a wrapper script. + +**Why learn this?**: If not using a container, this is a template for how to create +a portable software installation using your own files, especially if the software +is not available already compiled for Linux. + +Our Software Example +-------------------- + +For this exercise, we will be using the bioinformatics package HMMER. HMMER is a good example of software that is not compiled to a single executable; it has multiple executables as well as a helper library. + +1. Create a directory for this exercise on the Access Point. + +1. Do an internet search to find the HMMER software downloads page and the +installation instructions page. On the installation page, there are short instructions for how to install HMMER. There are two options shown for installation -- which should we use? + +1. For the purposes of this example, we are going to use the instructions under the heading "...to obtain and compile from source." Download the HMMER source as shown in +these instructions (command should start with `wget`) + +1. Go back to the installation +documentation page and look at the steps for compiling from source. This process +should be similar to what was described in the lecture! + +Installation +------------ + +Normally, it is better to install software on a dedicated "build" server, but +for this example, we are going to compile directly on the Access Point + +1. Before we follow the installation instructions, we should create a directory to hold our installation. You can create this in the current directory. + + :::console + username@host $ mkdir hmmer-build + +1. Now run the commands to unpack the source code: + + :::console + username@host $ tar -zxf hmmer.tar.gz + username@host $ cd hmmer-3.3.2 + +1. Now we can follow the second set of installation instructions. For the prefix, we'll use the variable `$PWD` to capture the name of our current working directory and then a relative path to the `hmmer-build` directory we created in step 1: + + :::console + username@host $ ./configure --prefix=$PWD/../hmmer-build + username@host $ make + username@host $ make install + +1. **Go back to the previous working directory**: + + :::console + username@host $ cd .. + + and confirm that our installation procedure created `bin`, `lib`, and `share` directories in the `hmmer-build` folder: + + :::console + username@host $ ls hmmer-build + bin share + +1. Now we want to package up our installation, so we can use it in other jobs. We can do this by compressing any necessary directories into a single gzipped tarball. + + :::console + username@host $ tar -czf hmmer-build.tar.gz hmmer-build + +Note that we now have two tarballs in our directory -- the *source* tarball (`hmmer.tar.gz`), which we will no longer need and our newly built installation (`hmmer-build.tar.gz`) which is what we will actually be using to run jobs. + +Wrapper Script +-------------- + +Now that we've created our portable installation, we need to write a script that opens and uses the installation, similar to the process we used in a [previous exercise](../part4-ex2-wrapper). These steps should be performed back on the submit server (`ap1.facility.path-cc.io`). + +1. Create a script called `run_hmmer.sh`. + +1. The script will first need to untar our installation, so the script should start out like this: + + :::bash + #!/bin/bash + + tar -xzf hmmer-build.tar.gz + +1. We're going to use the same `$PWD` trick from the installation in order to tell the computer how to find HMMER. We will do this by setting the `PATH` environment variable, to include the directory where HMMER is installed: + + :::bash + export PATH=$PWD/hmmer-build/bin:$PATH + +1. Finally, the wrapper script needs to not only setup HMMER, but actually run the program. Add the following lines to your `run_hmmer.sh` wrapper script. + + :::bash + hmmbuild globins4.hmm globins4.sto + hmmsearch -o search-results.txt globins4.hmm globins45.fa + +1. Make sure the wrapper script has executable permissions: + + :::console + username@ap1 $ chmod u+x run_HMMER.sh + + +Run a HMMER job +------------------- + +We're almost ready! We need two more pieces to run a HMMER job. + +1. We're going to use some of the tutorial files provided with the HMMER download to +run the job. You already have these files back in the directory where you unpacked the source code: + + :::console + username@ap1 $ ls hmmer-3.3.2/tutorial + 7LESS_DROME fn3.hmm globins45.fa globins4.sto MADE1.hmm Pkinase.hmm + dna_target.fa fn3.sto globins4.hmm HBB_HUMAN MADE1.sto Pkinase.sto + + If you don't see these files, you may want to redownload the `hmmer.tar.gz` file and untar it here. + +1. Our last step is to create a submit file for our HMMER job. Think about which lines this submit file will need. Make a copy of a previous submit file (you could use the blast submit file from a [previous exercise](../part4-ex2-wrapper) as a base) and modify it as you think necessary. + +1. The two most important lines to modify for this job are listed below; check them against your own submit file: + + :::file + executable = run_hmmer.sh + transfer_input_files = hmmer-build.tar.gz, hmmer-3.3.2/tutorial/ + + A wrapper script will always be a job's `executable`. + When using a wrapper script, you must also always remember to transfer the software/source code using + `transfer_input_files`. + + !!! note + The `/` in the `transfer_input_files` line indicates that we are transferring the *contents* of that directory (which in this case, is what we want), rather than the directory itself. + +1. Submit the job with `condor_submit`. + +1. Once the job completes, it should produce a `search-results.txt` file. + + !!! note + For a very similar compiling example, see this guide on how to + compile `samtools`: [Example Software Compilation](https://support.opensciencegrid.org/support/solutions/articles/12000074984-example-software-compilation) diff --git a/docs/materials/software/part5-ex2-python.md b/docs/materials/software/part5-ex2-python.md new file mode 100644 index 0000000..acb5417 --- /dev/null +++ b/docs/materials/software/part5-ex2-python.md @@ -0,0 +1,149 @@ +--- +status: testing +--- + + + +Software Exercise 5.2: Using Python, Pre-Built +=============================================== + +In this exercise, you will install Python, package your installation, and then use it to run jobs. It should take about 20 minutes. + +Background +---------- + +**Objective**: Install software (Python) to a folder and run it in a job using a wrapper script. + +**Why learn this?**: This is very similar to the [previous exercise](part5-ex1-prepackaged.md). + + +Interactive Job for Pre-Building +-------------------------------- + +The first step in our job process is building a Python installation that we can package up. + +1. Create a directory for this exercise on the Access Point and `cd` into it. +1. Download the Python source code from . + + :::console + username@ap1 $ wget https://www.python.org/ftp/python/3.10.5/Python-3.10.5.tgz + +1. First, we have to determine how to install Python to a specific location in our working directory. + 1. Untar the Python source tarball (`tar -xzf Python-3.10.5.tgz`) and look at the `README.rst` file in the `Python-3.10.5` directory (`cd Python-3.10.5`). You'll want to look for the "Build Instructions" header. What will the main installation steps be? What command is required for the final installation? Once you've tried to answer these questions, move to the next step. + 1. There are some basic installation instructions near the top of the `README`. Based on that short introduction, we can see the main steps of installation will be: + + ./configure + make + make test + sudo make install + + This three-stage process (configure, make, make install) is a common way to install many software packages. The default installation location for Python requires `sudo` (administrative privileges) to install. However, we'd like to install to a specific location in the working directory so that we can compress that installation directory into a tarball. + + 1. You can often use an option called `-prefix` with the `configure` script to change the default installation directory. Let's see if the Python `configure` script has this option by using the "help" option (as suggested in the `README.rst` file): + + :::console + username@host $ ./configure --help + + Sure enough, there's a list of all the different options that can be passed to the `configure` script, which includes `--prefix`. (To see the `--prefix` option, you may need to scroll towards the top of the output.) Therefore, we can use the `$PWD` command in order to set the path correctly to a custom installation directory. + +1. Now let's actually install Python! + 1. **From the original working directory**, create a directory to hold the installation. + + :::console + username@host $ cd ../ + username@host $ mkdir python310 + + 1. Move into the `Python-3.10.5` directory and run the installation commands. These may take a few minutes each. + + :::console + username@host $ cd Python-3.10.5 + username@host $ ./configure --prefix=$PWD/../python310 + username@host $ make + username@host $ make install + + !!! note + The installation instructions in the `README.rst` file have a `make test` step + between the `make` and `make install` steps. As this step isn't strictly necessary (and takes a long time), it's been omitted above. + + 1. If I move back to the main job working directory, and look in the `python` subdirectory, I should see a Python installation. + + :::console + username@host $ cd .. + username@host $ ls python310/ + bin include lib share + + 1. I have successfully created a self-contained Python installation. Now it just needs to be tarred up! + + :::console + username@host $ tar -czf prebuilt_python.tar.gz python310/ + +1. We might want to know how we installed Python for later reference. Enter the following commands to save our history to a file: + + :::console + username@host $ history > python_install.txt + +Python Script +------------- + +1. Create a script with the following lines called `fib.py`. + + :::python + import sys + import os + + if len(sys.argv) != 2: + print('Usage: %s MAXIMUM' % (os.path.basename(sys.argv[0]))) + sys.exit(1) + maximum = int(sys.argv[1]) + n1 = n2 = 1 + while n2 <= maximum: + n1, n2 = n2, n1 + n2 + print('The greatest Fibonacci number up to %d is %d' % (maximum, n1)) + +1. What command line arguments does this script take? Try running it on the submit server. + +Wrapper Script +-------------- + +We now have our Python installation and our Python script - we just need to write a wrapper script to run them. + +1. What steps do you think the wrapper script needs to perform? Create a file called `run_fib.sh` and write them out in plain English before moving to the next step. +1. Our script will need to + 1. untar our `prebuilt_python.tar.gz` file + 1. access the `python` command from our installation to run our `fib.py` script +1. Try turning your plain English steps into commands that the computer can run. +1. Your final `run_fib.sh` script should look something like this: + + :::bash + #!/bin/bash + + tar -xzf prebuilt_python.tar.gz + python310/bin/python3 fib.py 90 + + or + + :::bash + #!/bin/bash + + tar -xzf prebuilt_python.tar.gz + export PATH=$(pwd)/python310/bin:$PATH + python3 fib.py 90 + +1. Make sure your `run_fib.sh` script is executable. + +Submit File +----------- + +1. Make a copy of a previous submit file in your local directory (the submit file from +the [Use a Wrapper Script exercise](../part4-ex2-wrapper) might be a good candidate). What changes need to be made to run this Python job? + +1. Modify your submit file, then make sure you've included the key lines below: + + :::file + executable = run_fib.sh + transfer_input_files = fib.py, prebuilt_python.tar.gz + +1. Submit the job using `condor_submit`. + +1. Check the `.out` file to see if the job completed. + diff --git a/docs/materials/software/part5-ex3-conda.md b/docs/materials/software/part5-ex3-conda.md new file mode 100644 index 0000000..d75eee6 --- /dev/null +++ b/docs/materials/software/part5-ex3-conda.md @@ -0,0 +1,118 @@ +--- +status: testing +--- + + + +Software Exercise 5.3: Using Conda Environments +==================================== + +**Objective**: Create a portable conda environment and use it in a job. + +**Why learn this?**: If you normally use `conda` to manage your Python environments, +this method of software portability offers great similarity to your usual practices. + +Introduction +------------ + +Many Python users manage their Python installation and environments with either the +`Anaconda` or `miniconda` distributions. These distribution tools are great +for creating portable Python installations and can be used on HTC systems with +some help from a tool called `conda pack`. + +Sample Script +------------------- + +1. For this example, create a script called `rand_array.py` on the Access Point. + + :::file + import numpy as np + + #numpy array with random values + a = np.random.rand(4,2,3) + + print(a) + +To run this script, we will need a copy of Python with the `numpy` library. + +Create and Pack a Conda Environment +------------------ + +(For a generic version of these instructions, see the [CHTC User Guide](http://chtc.cs.wisc.edu/conda-installation)) + +1. Our first step is to create a miniconda installation on the submit server. + 1. You should be logged into whichever server you made the `rand_array.py` script on. + 2. Download the latest Linux [miniconda installer](https://docs.conda.io/en/latest/miniconda.html) + + :::console + user@login $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh + + 3. Run the installer to install miniconda; you'll need to accept the license terms and + you can use the default installation location: + + :::console + [user@login]$ sh Miniconda3-latest-Linux-x86_64.sh + + 4. At the end, you can choose whether or + not to "initialize Miniconda3 by running conda init?" The default is no; you would + then run the `eval` command listed by the installer to "activate" Miniconda. If you + choose "no" you'll want to save this command so that you can reactivate the + Miniconda installation when needed in the future. + +2. Next we'll create our conda "environment" with `numpy` (we've called the environment "py3-numpy"): + + :::console + (base)[user@login]$ conda create -n py3-numpy + (base)[user@login]$ conda activate py3-numpy + (py3-numpy)[user@login]$ conda install -c conda-forge numpy + +3. Once everything is installed, deactivate the environment to go back to the +Miniconda "base" environment. + + :::console + (py3-numpy)[user@login]$ conda deactivate + +4. We'll now install a tool that will pack up the just created conda environment +so we can run it elsewhere. Make sure that your job's Miniconda environment is created, but deactivated, so +that you're in the "base" Miniconda environment, then run: + + :::console + (base)[user@login]$ conda install -c conda-forge conda-pack + + Enter `y` when it asks you to install. + +5. Finally, we will run the `conda pack` command, which will automatically create a +tar.gz file with our environment: + + :::console + (base)[user@login]$ conda pack -n py3-numpy + +Submit a Job +------------- + +1. The executable for this job will need to be a wrapper script. What steps do you +think need to be included? Write down a rough draft, then compare with the following script. + +3. Create a wrapper script like the following: + + :::file + #!/bin/bash + + set -e + + export PATH + mkdir py3-numpy + tar -xzf py3-numpy.tar.gz -C py3-numpy + . py3-numpy/bin/activate + + python3 rand_array.py + +4. What needs to be included in your submit file for the job to run successfully? Try +yourself and then check the suggestions in the next point. + +5. In your submit file, make sure to have the following: + - Your executable should be the the bash script you created in the previous step. + - Remember to transfer your Python script and the environment `tar.gz` file via + `transfer_input_files`. + +6. Submit the job and see what happens! diff --git a/docs/materials/software/part5-ex4-compiling.md b/docs/materials/software/part5-ex4-compiling.md new file mode 100644 index 0000000..9d3bce0 --- /dev/null +++ b/docs/materials/software/part5-ex4-compiling.md @@ -0,0 +1,101 @@ +--- +status: testing +--- + + + +Software Exercise 5.4: Compile Statically Linked Code +========================================================== + +**Objective**: Compile code using static linking, explain why this can be useful. + +**Why learn this?**: When code is compiled, it is usually linked to other pieces +of code on the computer. This can cause it to not work when moved to other computers. +Static linking means that all the needed references are included in the compiled code, +meaning that it can run almost anywhere. + +Our Software Example +-------------------- + +For this compiling example, we will use a script written in C. C code depends on libraries and therefore will benefit from being statically linked. + +Our C code prints 7 rows of Pascal's triangle. + +1. Log into the Access Point. Create a directory for this exercise and `cd` into it. +1. Copy and paste the following code into a file named `pascal.c`. + + :::c++ + #include "stdio.h" + + long factorial(int); + + int main() + { + int i, n, c; + n=7; + for (i = 0; i < n; i++){ + for (c = 0; c <= (n - i - 2); c++) + printf(" "); + for (c = 0 ; c <= i; c++) + printf("%ld ",factorial(i)/(factorial(c)*factorial(i-c))); + printf("\n"); + } + return 0; + } + + long factorial(int n) + { + int c; + long result = 1; + for (c = 1; c <= n; c++) + result = result*c; + return result; + } + +Compiling +--------- + +In order to use this code in a job, we will first need to statically compile the code. + +1. Most linux servers (including our Access Point) have the `gcc` (GNU compiler collection) installed, so we already have a compiler on the Access Point. Furthermore, this is a simple piece of C code, so the compilation will not be computationally intensive. Thus, we should be able to compile directly on the Access Point + +1. Compile the code, using the command: + + :::console + username@login $ gcc -static pascal.c -o pascal + + Note that we have added the `-static` option to make sure that the compiled binary includes the necessary libraries. This will allow the code to run on any Linux machine, no matter where those libraries are located. + +1. Verify that the compiled binary was statically linked: + + :::console + username@login $ file pascal + +The Linux `file` command provides information about the *type* or *kind* of file that is given as an argument. In this case, you should get output like this: + +```hl_lines="2 3" +username@host $ file pascal +pascal: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, +for GNU/Linux 2.6.18, not stripped +``` + +The output clearly states that this executable (software) is statically linked. The same command run on a non-statically linked executable file would include the text `dynamically linked (uses shared libs)` instead. So with this simple verification step, which could even be run on files that you did not compile yourself, you have some further reassurance that it is safe to use on other Linux machines. (Bonus exercise: Try the `file` command on lots of other files) + +Submit the Job +-------------- + +Now that our code is compiled, we can use it to submit a job. + +1. Think about what submit file lines we need to use to run this job: + - Are there input files? + - Are there command line arguments? + - Where is its output written? + +1. Based on what you thought about in 1., find a submit file from earlier that you can modify to run our compiled `pascal` code. + +1. Copy it to the directory with the `pascal` binary and make those changes. + +1. Submit the job using `condor_submit`. + +1. Once the job has run and left the queue, you should be able to see the results (seven rows of Pascal's triangle) in the `.out` file created by the job. + diff --git a/docs/materials/special/files/osgus23-special.pdf b/docs/materials/special/files/osgus23-special.pdf new file mode 100644 index 0000000..9b24572 Binary files /dev/null and b/docs/materials/special/files/osgus23-special.pdf differ diff --git a/docs/materials/special/files/osgus23-special.pptx b/docs/materials/special/files/osgus23-special.pptx new file mode 100644 index 0000000..aae8b1f Binary files /dev/null and b/docs/materials/special/files/osgus23-special.pptx differ diff --git a/docs/materials/special/part1-ex1-gpus.md b/docs/materials/special/part1-ex1-gpus.md new file mode 100644 index 0000000..2cd2385 --- /dev/null +++ b/docs/materials/special/part1-ex1-gpus.md @@ -0,0 +1,108 @@ +--- +status: testing +--- + +Exercise 1.1: GPUs +================== + +Exploring Availability +---------------------- + +For this exercise, we will use the `ap40.uw.osg-htc.org` access point. Log in: + +``` hl_lines="1" +$ ssh @ap40.uw.osg-htc.org +``` + +Let's first explore what GPUs are available in the OSPool. Remember +that the pool is dynamic - resources are beeing added and removed all +the time - but we can at least find out what the current set of GPUs +are there. Run: + + :::console + user@ap40 $ condor_status -const 'GPUs > 0' + +Once you have that list, pick one of the resources and look at the +classad using the `-l` flag. For example: + + :::console + user@ap40 $ condor_status -l [MACHINE] + +Using the `-autoformat` flag, explore the different attributes +of the GPUs. Some interesting attributes might be `GPUs_DeviceName`, +`GPUs_Capability`, `GLIDEIN_Site` and `GLIDEIN_ResourceName`. + +Compare the `Mips` number of a GPU slot with a regular slot. Does +the `Mips` number indicate that GPUs can be much faster than CPUs? +Why/why not? + + +A sample GPU job +---------------- + +Create a file named `mytf.py` and chmod it to be executable. The +content is a sample TensorFlow code: + +``` +#!/usr/bin/python3 + +# http://learningtensorflow.com/lesson10/ + +import sys +import numpy as np +import tensorflow as tf +from datetime import datetime + +tf.debugging.set_log_device_placement(True) + +# Create some tensors +a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) +b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) +c = tf.matmul(a, b) + +print(c) +``` + +Then, create a submit file to run the code on a GPU, using a +TensorFlow container image. The new bits of the submit file +is provided below, but you will have to fill in the rest +from what you have learnt earlier in the User School. + +``` +universe = container +container_image = /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.2-cuda-10.1 + +executable = mytf.py + +request_gpus = 1 +``` + +Note that TensorFlow also require the AVX2 CPU extensions. Remember +that AVX2 is available in the `x86_64-v3` and `x86_64-v4` +micro architectures. Add a `requirements` line stating that +`Microarch` has to be one of those two (the operand for +or in the classad experssions is `||`) + +Submit the job and watch the queue. Did the job start +running as quickly as when we ran CPU jobs? Why/why not? + +Examine the out/err files. Does it indicate somewhere that +the job was mapped to a GPU? (Hint: search for +`Created TensorFlow device`) + +Keep a copy of the out/err. Modify the submit file to _not_ run +on a GPU, and the try the job again. Did the job work? Does +the err from the CPU job look anything like the GPU err? + + + + + + + + + + + + + diff --git a/docs/materials/troubleshooting/files/.empty b/docs/materials/troubleshooting/files/.empty new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/docs/materials/troubleshooting/files/.empty @@ -0,0 +1 @@ + diff --git a/docs/materials/troubleshooting/files/OSGUS2023_troubleshooting.pdf b/docs/materials/troubleshooting/files/OSGUS2023_troubleshooting.pdf new file mode 100644 index 0000000..0c5f02b Binary files /dev/null and b/docs/materials/troubleshooting/files/OSGUS2023_troubleshooting.pdf differ diff --git a/docs/materials/troubleshooting/files/OSGUS2023_troubleshooting.pptx b/docs/materials/troubleshooting/files/OSGUS2023_troubleshooting.pptx new file mode 100644 index 0000000..4e3682c Binary files /dev/null and b/docs/materials/troubleshooting/files/OSGUS2023_troubleshooting.pptx differ diff --git a/docs/materials/troubleshooting/part1-ex1-troubleshooting.md b/docs/materials/troubleshooting/part1-ex1-troubleshooting.md new file mode 100644 index 0000000..9c0ca84 --- /dev/null +++ b/docs/materials/troubleshooting/part1-ex1-troubleshooting.md @@ -0,0 +1,111 @@ +--- +status: in progress +--- + +# OSG Exercise 2.1: Troubleshooting Jobs + +The goal of this exercise is to practice troubleshooting some common problems +that you may encounter when submitting jobs using HTCondor. + +This exercise should work on either of the access points- OSPool or Path Facility + +**Note:** This exercise is a little harder than some others. +To complete it, you will have to find and fix several issues. +Be patient, keep trying, but if you really get stuck, +you can ask for help or look at the very bottom of this page for a link to answers. +But try not to look at the answers! + +## Acquiring the Materials + +We have prepared some Python code, data, and submit files for this exercise: + +1. Log into an Access Point +1. Download a tarball of the materials: + + :::console + user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting.tar.gz + +3. Extract the tarball using the commands that you learned earlier +4. Change into the newly extracted directory and explore its contents — + resist the temptation to fix things right away! + +## Solving a Project Euler Problem + +The contents of the tarball contain a series of submit files, Python scripts, and an input file +that are designed to solve [Project Euler problem 98](https://projecteuler.net/problem=98): + +> By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = +> 36^2. What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square +> number: 9216 = 96^2. We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes +> are not permitted, neither may a different letter have the same digital value as another letter. +> +> Using p098_words.txt, a 16K text file containing nearly two-thousand common English words, find all the square +> anagram word pairs (a palindromic word is NOT considered to be an anagram of itself). +> +> What is the largest square number formed by any member of such a pair? +> +> **NOTE:** All anagrams formed must be contained in the given text file. + +Unfortunately, there are many issues with the submit files that you will have to work through +before you can obtain the solution to the problem! +The code in the Python scripts themselves is, in theory, free of bugs. + +## Finding anagrams + +The first step in our workflow takes an input file with a list of words (`p098_words.txt`) +and extracts all of the anagrams using the `find_anagrams.py` script. +Naturally, we want to run this as an HTCondor job, so: + +1. Submit the accompanying `find-anagrams.sub` file from the tarball. +1. Resolve any issues that you encounter until the job returns pairs of anagrams as its output. + +Once you have satisfactory output, move onto the next section. + +!!! note "Please be polite" + Access points are shared resources, so you should clean up after yourself. + If you discover any jobs in the Hold state, and after you are done troubleshooting them, + remove them with the following command: + + :::console + user@server $ condor_rm -const 'JobStatus =?= 5' + + | Where replacing `` with... | Will remove... | + |--------------------------------------------------------|---------------------------------------------| + | Your username (e.g. `blin`) | All of your held jobs | + | A cluster ID (e.g. `74078`) | All held jobs matching the given cluster ID | + | A job ID (e.g. `97932.30`) | That specific held job | + +## Finding the largest square + +The next step in the workflow uses the `max_square.py` script to find the largest square number, +if any, for a given anagram word pair. +Let's submit jobs that run `max_square.py` for all of the anagram word pairs (i.e., one job per word pair), +that you found in the previous section: + +1. Submit the accompanying `squares.sub` file from the tarball +1. Resolve any issues that you encounter until you receive output for each job. + Note that some jobs may have empty output since not all anagram word pairs are *square* anagram word pairs. + +Next, you can find the largest square among your output by directly using the command line. +For example, if all of your job output has been placed in the `squares` directory +and are named `square-1.out`, `square-2.out`, etc., +then you could run the following command to find the largest square: + +``` console +user@server $ cat squares/square-*.out | sort -n | tail -n 1 +``` + +You can check if you have the right answer with any of the OSG staff +or by submitting the answer to Project Euler (requires an account). + +## Answer Key + +There is also a working solution on our web server that can be retrieved with + +``` console +user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting-key.tar.gz +``` + +It contains comments labeled `SOLUTION` that you can consult in case you get stuck. +Like any answer key, it is mainly useful as a verification tool, +so try to only use it as a last resort or for detailed explanations to improve your understanding. diff --git a/docs/materials/troubleshooting/part1-ex2-job-retry.md b/docs/materials/troubleshooting/part1-ex2-job-retry.md new file mode 100644 index 0000000..7a26969 --- /dev/null +++ b/docs/materials/troubleshooting/part1-ex2-job-retry.md @@ -0,0 +1,111 @@ +--- +status: testing +--- + + + +Exercise 1.2: Retries +============================ + +The goal of this exercise is to demonstrate running a job that intermittently fails and thus could benefit from having HTCondor automatically retry it. + +This first part of the exercise should take only a few minutes, and is designed to setup the next exercises. + +Bad Job +------- + +Let’s assume that a colleague has shared with you a program, and it fails once in a while. In the real world, we would probably just fix the program, but what if you cannot change the software? Unfortunately, this situation happens more often than we would like. + +Below is a Python script that fails once in a while. +We will not fix it, but instead use it to simulate a program that can fail and that we **cannot** fix. + +``` python +#!/usr/bin/env python3 + +# murphy.py simulates a real program with real problems +import random +import sys +import time + +# For one out of every three attempts, simulate a runtime error +if random.randint(0, 2) == 0: + # Intentionally don't print any output + sys.exit(15) +else: + time.sleep(3) + print("All work done correctly") + +# By convention, zero exit code means success +sys.exit(0) +``` + +Let’s see what happens when a program like this one is run in HTCondor. + +1. In a new directory for this exercise, save the script above as `murphy.py`. +1. Write a submit file for the script; `queue 20` instances of the job and be sure to ask for 20 MB of memory and disk. +1. Submit the file, note the ClusterId, and wait for the jobs to finish. + +What output do you expect? What output did you get? If you are curious about the exit code from the job, it is saved in completed jobs in `condor_history` in the `ExitCode` attribute. The following command will show the `ExitCode` for a given cluster of jobs: + +``` console +user@server $ condor_history -af:h ProcId ExitCode +``` + +(Be sure to replace `` with your actual cluster ID. The command may take a minute or so to complete.) + +How many of the jobs succeeded? How many failed? + +Retrying Failed Jobs +-------------------- + +Now let’s see if we can solve the problem of jobs that fail once in a while. In this particular case, if HTCondor runs a failed job again, it has a good chance of succeeding. Not all failing jobs are like this, but in this case it is a reasonable assumption. + +From the lecture materials, implement the `max_retries` feature to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Did your change work? + +After the jobs have finished, examine the log file(s) to see what happened in detail. Did any jobs need to be restarted? Another way to see how many restarts there were is to look at the `NumJobStarts` attribute of a completed job with the `condor_history` command, in the same way you looked at the `ExitCode` attribute earlier. Does the number of retries seem correct? For those jobs which did need to be retried, what is their `ExitCode`; and what about the `ExitCode` from earlier execution attempts? + +A (Too) Long Running Job +------------------------ + +Sometimes, an ill-behaved job will get stuck in a loop and run forever, instead of exiting with a failure code, and it may just need to be re-run (or run on a different execute server) to complete without getting stuck. We can modify our Python program to simulate this kind of bad job with the following file: + +``` python +#!/usr/bin/env python3 + +# murphy.py simulate a real program with real problems +import random +import sys +import time + +# For one out of every three attempts, simulate an "infinite" loop +if random.randint(0, 2) == 0: + # Intentionally don't print any output + time.sleep(3600) + sys.exit(15) +else: + time.sleep(3) + print("All work done correctly") + +# By convention, zero exit code means success +sys.exit(0) +``` + +Let’s see what happens when a program like this one is run in HTCondor. + +1. Save the script to a new file named `murphy2.py`. +1. Copy your previous submit file to a new name and change the `executable` to `murphy2.py`. +1. If you like, submit the new file — but after a while be sure to remove the whole cluster to clear out the “hung” jobs. +1. Now try to change the submit file to automatically remove any jobs that **run** for more than one minute. You can make this change with just a single line in your submit file + + :::file + periodic_remove = (JobStatus == 2) && ( (CurrentTime - EnteredCurrentStatus) > 60 ) + +1. Submit the new file. Do the long running jobs get removed? What does `condor_history` show for the cluster after all jobs are done? Which job status (i.e. idle, held, running) do you think `JobStatus == 2` corresponds to? + +Bonus Exercise +-------------- + +If you have time, edit your submit file so that instead of removing long running jobs, +HTCondor will automatically put the long-running job on hold, +and then automatically release it. + diff --git a/docs/materials/welcome/files/osgus23-day1-part1-welcome-timc.pdf b/docs/materials/welcome/files/osgus23-day1-part1-welcome-timc.pdf new file mode 100644 index 0000000..855d78f Binary files /dev/null and b/docs/materials/welcome/files/osgus23-day1-part1-welcome-timc.pdf differ diff --git a/docs/materials/workflows/files/osgus23-dagman.pdf b/docs/materials/workflows/files/osgus23-dagman.pdf new file mode 100644 index 0000000..9ad8084 Binary files /dev/null and b/docs/materials/workflows/files/osgus23-dagman.pdf differ diff --git a/docs/materials/workflows/files/osgus23-dagman.pptx b/docs/materials/workflows/files/osgus23-dagman.pptx new file mode 100644 index 0000000..8ae288d Binary files /dev/null and b/docs/materials/workflows/files/osgus23-dagman.pptx differ diff --git a/docs/materials/workflows/files/osgvsp20-workflows-part1-ex1-simple-dag.gif b/docs/materials/workflows/files/osgvsp20-workflows-part1-ex1-simple-dag.gif new file mode 100644 index 0000000..0cd63e5 Binary files /dev/null and b/docs/materials/workflows/files/osgvsp20-workflows-part1-ex1-simple-dag.gif differ diff --git a/docs/materials/workflows/part1-ex1-simple-dag.md b/docs/materials/workflows/part1-ex1-simple-dag.md new file mode 100644 index 0000000..e65710e --- /dev/null +++ b/docs/materials/workflows/part1-ex1-simple-dag.md @@ -0,0 +1,313 @@ +--- +status: testing +--- + + + +# Workflows Exercise 1.1: Coordinating a Set of Jobs With a Simple DAG + +The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job. + +## What is DAGMan? + +In short, DAGMan lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. +For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After +the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five +parameters: + +![simple DAG](files/osgvsp20-workflows-part1-ex1-simple-dag.gif) + +DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can +be found at [in the HTCondor manual](https://htcondor.readthedocs.io/en/latest/automated-workflows/index.html). + +## Submitting a Simple DAG + +For our job, we will return briefly to the `sleep` program, name it `job.sub` + +``` file +executable = /bin/sleep +arguments = 4 +log = simple.log +output = simple.out +error = simple.error +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, +you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. + +First we will create the most minimal DAG that can be created: a DAG with just one node. Put this into a file named +`simple.dag`. + +``` file +JOB Simple job.sub +``` + +In your first window, submit the DAG: + +``` console +username@ap40 $ condor_submit_dag simple.dag +----------------------------------------------------------------------- +File for submitting this DAG to Condor : simple.dag.condor.sub +Log of DAGMan debugging messages : simple.dag.dagman.out +Log of Condor library output : simple.dag.lib.out +Log of Condor library error messages : simple.dag.lib.err +Log of the life of condor_dagman itself : simple.dag.dagman.log + +Submitting job(s). +1 job(s) submitted to cluster 61. +----------------------------------------------------------------------- +``` + +In the second window, check the queue (what you see may be slightly different): + +``` console +username@ap40 $ condor_q -nobatch -wide:80 + +-- Submitter: learn.chtc.wisc.edu : <128.104.100.55:9618?sock=28867_10e4_2> : learn.chtc.wisc.edu + ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD + 61.0 roy 6/21 22:51 0+00:03:47 R 0 0.3 condor_dagman + 62.0 roy 6/21 22:51 0+00:00:03 R 0 0.7 simple 4 10 + +2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended +``` + +In the third window, watch what DAGMan does (what you see may be slightly different): + +``` console +username@ap40 $ tail -f --lines=500 simple.dag.dagman.out +08/02/23 15:44:57 ****************************************************** +08/02/23 15:44:57 ** condor_scheduniv_exec.271100.0 (CONDOR_DAGMAN) STARTING UP +08/02/23 15:44:57 ** /usr/bin/condor_dagman +08/02/23 15:44:57 ** SubsystemInfo: name=DAGMAN type=DAGMAN(9) class=CLIENT(2) +08/02/23 15:44:57 ** Configuration: subsystem:DAGMAN local: class:CLIENT +08/02/23 15:44:57 ** $CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ +08/02/23 15:44:57 ** $CondorPlatform: x86_64_AlmaLinux8 $ +08/02/23 15:44:57 ** PID = 2340103 +08/02/23 15:44:57 ** Log last touched time unavailable (No such file or directory) +08/02/23 15:44:57 ****************************************************** +08/02/23 15:44:57 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS +08/02/23 15:44:57 DaemonCore: No command port requested. +08/02/23 15:44:57 DAGMAN_USE_STRICT setting: 1 +08/02/23 15:44:57 DAGMAN_VERBOSITY setting: 3 +08/02/23 15:44:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 +08/02/23 15:44:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False +08/02/23 15:44:57 DAGMAN_SUBMIT_DELAY setting: 0 +08/02/23 15:44:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 +08/02/23 15:44:57 DAGMAN_STARTUP_CYCLE_DETECT setting: False +08/02/23 15:44:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 100 +08/02/23 15:44:57 DAGMAN_AGGRESSIVE_SUBMIT setting: False +08/02/23 15:44:57 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 +08/02/23 15:44:57 DAGMAN_QUEUE_UPDATE_INTERVAL setting: 300 +08/02/23 15:44:57 DAGMAN_DEFAULT_PRIORITY setting: 0 +08/02/23 15:44:57 DAGMAN_SUPPRESS_NOTIFICATION setting: True +08/02/23 15:44:57 allow_events (DAGMAN_ALLOW_EVENTS) setting: 114 +08/02/23 15:44:57 DAGMAN_RETRY_SUBMIT_FIRST setting: True +08/02/23 15:44:57 DAGMAN_RETRY_NODE_FIRST setting: False +08/02/23 15:44:57 DAGMAN_MAX_JOBS_IDLE setting: 1000 +08/02/23 15:44:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 +08/02/23 15:44:57 DAGMAN_MAX_PRE_SCRIPTS setting: 20 +08/02/23 15:44:57 DAGMAN_MAX_POST_SCRIPTS setting: 20 +08/02/23 15:44:57 DAGMAN_MAX_HOLD_SCRIPTS setting: 20 +08/02/23 15:44:57 DAGMAN_MUNGE_NODE_NAMES setting: True +08/02/23 15:44:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: False +08/02/23 15:44:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: False +08/02/23 15:44:57 DAGMAN_ALWAYS_RUN_POST setting: False +08/02/23 15:44:57 DAGMAN_CONDOR_SUBMIT_EXE setting: /usr/bin/condor_submit +08/02/23 15:44:57 DAGMAN_USE_DIRECT_SUBMIT setting: True +08/02/23 15:44:57 DAGMAN_DEFAULT_APPEND_VARS setting: False +08/02/23 15:44:57 DAGMAN_ABORT_DUPLICATES setting: True +08/02/23 15:44:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True +08/02/23 15:44:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 +08/02/23 15:44:57 DAGMAN_AUTO_RESCUE setting: True +08/02/23 15:44:57 DAGMAN_MAX_RESCUE_NUM setting: 100 +08/02/23 15:44:57 DAGMAN_WRITE_PARTIAL_RESCUE setting: True +08/02/23 15:44:57 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log +08/02/23 15:44:57 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True +08/02/23 15:44:57 DAGMAN_MAX_JOB_HOLDS setting: 100 +08/02/23 15:44:57 DAGMAN_HOLD_CLAIM_TIME setting: 20 +08/02/23 15:44:57 ALL_DEBUG setting: +08/02/23 15:44:57 DAGMAN_DEBUG setting: +08/02/23 15:44:57 DAGMAN_SUPPRESS_JOB_LOGS setting: False +08/02/23 15:44:57 DAGMAN_REMOVE_NODE_JOBS setting: True +08/02/23 15:44:57 DAGMAN will adjust edges after parsing +08/02/23 15:44:57 argv[0] == "condor_scheduniv_exec.271100.0" +08/02/23 15:44:57 argv[1] == "-Lockfile" +08/02/23 15:44:57 argv[2] == "simple.dag.lock" +08/02/23 15:44:57 argv[3] == "-AutoRescue" +08/02/23 15:44:57 argv[4] == "1" +08/02/23 15:44:57 argv[5] == "-DoRescueFrom" +08/02/23 15:44:57 argv[6] == "0" +08/02/23 15:44:57 argv[7] == "-Dag" +08/02/23 15:44:57 argv[8] == "simple.dag" +08/02/23 15:44:57 argv[9] == "-Suppress_notification" +08/02/23 15:44:57 argv[10] == "-CsdVersion" +08/02/23 15:44:57 argv[11] == "$CondorVersion: 10.7.0 2023-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $" +08/02/23 15:44:57 argv[12] == "-Dagman" +08/02/23 15:44:57 argv[13] == "/usr/bin/condor_dagman" +08/02/23 15:44:57 Default node log file is: +08/02/23 15:44:57 DAG Lockfile will be written to simple.dag.lock +08/02/23 15:44:57 DAG Input file is simple.dag +08/02/23 15:44:57 Parsing 1 dagfiles +08/02/23 15:44:57 Parsing simple.dag ... +08/02/23 15:44:57 Adjusting edges +08/02/23 15:44:57 Dag contains 1 total jobs +08/02/23 15:44:57 Bootstrapping... +08/02/23 15:44:57 Number of pre-completed nodes: 0 +08/02/23 15:44:57 MultiLogFiles: truncating log file /home/mats.rynge/dagman-1/./simple.dag.nodes.log +08/02/23 15:44:57 DAG status: 0 (DAG_STATUS_OK) +08/02/23 15:44:57 Of 1 nodes total: +08/02/23 15:44:57 Done Pre Queued Post Ready Un-Ready Failed Futile +08/02/23 15:44:57 === === === === === === === === +08/02/23 15:44:57 0 0 0 0 1 0 0 0 +08/02/23 15:44:57 0 job proc(s) currently held +08/02/23 15:44:57 Registering condor_event_timer... +08/02/23 15:44:58 Submitting HTCondor Node Simple job(s)... +``` + +**Here's where the job is submitted** + +```file +08/02/23 15:44:58 Submitting HTCondor Node Simple job(s)... +08/02/23 15:44:58 Submitting node Simple from file job.sub using direct job submission +08/02/23 15:44:58 assigned HTCondor ID (271101.0.0) +08/02/23 15:44:58 Just submitted 1 job this cycle... +``` + +**Here's where DAGMan noticed that the job is running** + +```file +08/02/23 15:45:18 Event: ULOG_EXECUTE for HTCondor Node Simple (271101.0.0) {08/02/23 15:45:14} +08/02/23 15:45:18 Number of idle job procs: 0 +``` + +**Here's where DAGMan noticed that the job finished.** + +```file +08/02/23 15:45:23 Event: ULOG_JOB_TERMINATED for HTCondor Node Simple (271101.0.0) {08/02/23 15:45:19} +08/02/23 15:45:23 Number of idle job procs: 0 +08/02/23 15:45:23 Node Simple job proc (271101.0.0) completed successfully. +08/02/23 15:45:23 Node Simple job completed +08/02/23 15:45:23 DAG status: 0 (DAG_STATUS_OK) +08/02/23 15:45:23 Of 1 nodes total: +08/02/23 15:45:23 Done Pre Queued Post Ready Un-Ready Failed Futile +08/02/23 15:45:23 === === === === === === === === +08/02/23 15:45:23 1 0 0 0 0 0 0 0 +``` + +**Here's where DAGMan noticed that all the work is done.** + +```file +08/02/23 15:45:23 All jobs Completed! +08/02/23 15:45:23 Note: 0 total job deferrals because of -MaxJobs limit (0) +08/02/23 15:45:23 Note: 0 total job deferrals because of -MaxIdle limit (1000) +08/02/23 15:45:23 Note: 0 total job deferrals because of node category throttles +08/02/23 15:45:23 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER +08/02/23 15:45:23 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER +08/02/23 15:45:23 Note: 0 total HOLD script deferrals because of -MaxHold limit (20) or DEFER +``` + +Now verify your results: + +``` console +username@ap40 $ cat simple.log +000 (271101.000.000) 2023-08-02 15:44:58 Job submitted from host: <128.105.68.92:9618?addrs=128.105.68.92-9618+[2607-f388-2200-100-eaeb-d3ff-fe40-111c]-9618&alias=ap40.uw.osg-htc.org&noUDP&sock=schedd_35391_dc5c> + DAG Node: Simple +... +040 (271101.000.000) 2023-08-02 15:45:13 Started transferring input files + Transferring to host: <10.136.81.233:37425?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector4#23067238%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b6]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1512850&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-37425&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> +... +040 (271101.000.000) 2023-08-02 15:45:13 Finished transferring input files +... +021 (271101.000.000) 2023-08-02 15:45:14 Warning from starter on slot1_4@glidein_2635188_104012775@comp-cc-0463.gwave.ics.psu.edu: + PREPARE_JOB (prepare-hook) succeeded (reported status 000): Using default Singularity image /cvmfs/singularity.opensciencegrid.org/htc/rocky:8-cuda-11.0.3 +... +001 (271101.000.000) 2023-08-02 15:45:14 Job executing on host: <10.136.81.233:39645?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector10#1506459%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b4]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1506644&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-39645&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> + SlotName: slot1_4@comp-cc-0463.gwave.ics.psu.edu + CondorScratchDir = "/localscratch/condor/execute/dir_2635172/glide_uZ6qXM/execute/dir_3252113" + Cpus = 1 + Disk = 2699079 + GLIDEIN_ResourceName = "PSU-LIGO" + Memory = 1024 +... +006 (271101.000.000) 2023-08-02 15:45:19 Image size of job updated: 2296464 + 47 - MemoryUsage of job (MB) + 47684 - ResidentSetSize of job (KB) +... +040 (271101.000.000) 2023-08-02 15:45:19 Started transferring output files +... +040 (271101.000.000) 2023-08-02 15:45:19 Finished transferring output files +... +005 (271101.000.000) 2023-08-02 15:45:19 Job terminated. + (1) Normal termination (return value 0) + Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage + Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage + Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage + Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage + 0 - Run Bytes Sent By Job + 38416 - Run Bytes Received By Job + 0 - Total Bytes Sent By Job + 38416 - Total Bytes Received By Job + Partitionable Resources : Usage Request Allocated + Cpus : 1 1 + Disk (KB) : 149 1048576 2699079 + Memory (MB) : 47 1024 1024 + + Job terminated of its own accord at 2023-08-02T20:45:19Z with exit-code 0. +... +``` + +Looking at DAGMan's various files, we see that DAGMan itself ran as a job (specifically, a "scheduler" universe job). + +``` console +username@ap40 $ ls simple.dag.* +simple.dag.condor.sub simple.dag.dagman.log simple.dag.dagman.out simple.dag.lib.err simple.dag.lib.out + +username@ap40 $ cat simple.dag.condor.sub +# Filename: simple.dag.condor.sub +# Generated by condor_submit_dag simple.dag +universe = scheduler +executable = /usr/bin/condor_dagman +getenv = CONDOR_CONFIG,_CONDOR_*,PATH,PYTHONPATH,PERL*,PEGASUS_*,TZ,HOME,USER,LANG,LC_ALL +output = simple.dag.lib.out +error = simple.dag.lib.err +log = simple.dag.dagman.log +remove_kill_sig = SIGUSR1 ++OtherJobRemoveRequirements = "DAGManJobId =?= $(cluster)" +# Note: default on_exit_remove expression: +# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) +# attempts to ensure that DAGMan is automatically +# requeued by the schedd if it exits abnormally or +# is killed (e.g., during a reboot). +on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) +copy_to_spool = False +arguments = "-p 0 -f -l . -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -Suppress_notification -CsdVersion $CondorVersion:' '10.7.0' '2023-07-10' 'BuildID:' '659788' 'PackageID:' '10.7.0-0.659788' 'RC' '$ -Dagman /usr/bin/condor_dagman" +environment = "_CONDOR_DAGMAN_LOG=simple.dag.dagman.out _CONDOR_MAX_DAGMAN_LOG=0 _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address _CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad" +queue +``` + +If you want to clean up some of these files (you may not want to, at least not yet), run: + +``` console +username@ap40 $ rm simple.dag.* +``` + +## Challenge + +- What is the scheduler universe? Why does DAGMan use it? + +??? "Show hint" + + 1. HTCondor has several [universes](https://htcondor.readthedocs.io/en/latest/users-manual/choosing-an-htcondor-universe.html) + + 1. What would happen to your DAGMan workflow if the access point has to be rebooted? + + 1. Jobs in the HTCondor queue are "managed" - they are always tracked, and restarted automatically if needed + + diff --git a/docs/materials/workflows/part1-ex2-mandelbrot.md b/docs/materials/workflows/part1-ex2-mandelbrot.md new file mode 100644 index 0000000..49bcfa8 --- /dev/null +++ b/docs/materials/workflows/part1-ex2-mandelbrot.md @@ -0,0 +1,76 @@ +--- +status: testing +--- + + + +Workflows Exercise 1.2: A Brief Detour Through the Mandelbrot Set +============================================================== + +Before we explore using DAGs to implement workflows, let’s get a more interesting job. Let’s make pretty pictures! + +We have a small program that draws pictures of the Mandelbrot set. You can [read about the Mandelbrot set on Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set), or you can simply appreciate the pretty pictures. It’s a fractal. + +We have a simple program that can draw the Mandelbrot set. It's called `goatbrot`. + +Before beginning, ensure that you are connected to `ap40.uw.osg-htc.org`. Create a directory for this exercise and cd into it. + +Running goatbrot From the Command Line +-------------------------------------- + +You can generate the Mandelbrot set as a quick test with two simple commands. + +1. Download the goatbrot executable: + + :::console + username@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/goatbrot + username@ap40 $ chmod a+x goatbrot + + +1. Generate a PPM image of the Mandelbrot set: + + :::console + username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c 0,0 -w 3 -s 1000,1000 + + The `goatbroat` program takes several parameters. Let's break them down: + + - `-i 1000` The number of iterations. Bigger numbers generate more accurate images but are slower to run. + - `-o tile_000000_000000.ppm` The output file to generate. + - `-c 0,0` The center point of the image. Here it is the point (0,0). + - `-w 3` The width of the image. Here is 3. + - `-s 1000,1000` The size of the final image. Here we generate a picture that is 1000 pixels wide and 1000 pixels tall. + +1. Convert the image to the JPEG format (using a built-in program called `convert`): + + :::console + username@ap40 $ convert tile_000000_000000.ppm mandel.jpg + +Dividing the Work into Smaller Pieces +------------------------------------- + +The Mandelbrot set can take a while to create, particularly if you make the iterations large or the image size large. What if we broke the creation of the image into multiple invocations (an HTC approach!) then stitched them together? Once we do that, we can run each `goatbroat` in parallel in our cluster. Here's an example you can run by hand. + +1. Run goatbroat 4 times: + + :::console + username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c -0.75,0.75 -w 1.5 -s 500,500 + username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000001.ppm -c 0.75,0.75 -w 1.5 -s 500,500 + username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000000.ppm -c -0.75,-0.75 -w 1.5 -s 500,500 + username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000001.ppm -c 0.75,-0.75 -w 1.5 -s 500,500 + +1. Stitch the small images together into the complete image (in JPEG format): + + :::console + username@ap40 $ montage tile_000000_000000.ppm tile_000000_000001.ppm tile_000001_000000.ppm tile_000001_000001.ppm -mode Concatenate -tile 2x2 mandel.jpg + +This will produce the same image as above. We divided the image space into a 2×2 grid and ran `goatbrot` on each section of the grid. The built-in `montage` program stitches the files together and writes out the final image in JPEG format. + +View the Image! +--------------- + +Run the commands above so that you have the Mandelbrot image. +When you create the image, you might wonder how you can view it. +Use `scp` or `sftp` to copy the `mandel.jpg` back to your computer to view it. + + + diff --git a/docs/materials/workflows/part1-ex3-complex-dag.md b/docs/materials/workflows/part1-ex3-complex-dag.md new file mode 100644 index 0000000..08df972 --- /dev/null +++ b/docs/materials/workflows/part1-ex3-complex-dag.md @@ -0,0 +1,211 @@ +--- +status: testing +--- + + + +Workflows Exercise 1.3: A More Complex DAG +======================================= + +The objective of this exercise is to run a real set of jobs with DAGMan. + +Make Your Job Submission Files +------------------------------ + +We'll run our `goatbrot` example. If you didn't read about it yet, [please do so now](../part1-ex2-mandelbrot). We are going to make a DAG with four simultaneous jobs (`goatbrot`) and one final node to stitch them together (`montage`). This means we have five jobs. We're going to run `goatbrot` with more iterations (100,000) so each job will take longer to run. + +You can create your five jobs. The goatbrot jobs are very similar to each other, but they have slightly different parameters and output files. + +### goatbrot1.sub + +``` file +executable = goatbrot +arguments = -i 100000 -c -0.75,0.75 -w 1.5 -s 500,500 -o tile_0_0.ppm +log = goatbrot.log +output = goatbrot.out.0.0 +error = goatbrot.err.0.0 +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +### goatbrot2.sub + +``` file +executable = goatbrot +arguments = -i 100000 -c 0.75,0.75 -w 1.5 -s 500,500 -o tile_0_1.ppm +log = goatbrot.log +output = goatbrot.out.0.1 +error = goatbrot.err.0.1 +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +### goatbrot3.sub + +``` file +executable = goatbrot +arguments = -i 100000 -c -0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_0.ppm +log = goatbrot.log +output = goatbrot.out.1.0 +error = goatbrot.err.1.0 +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +### goatbrot4.sub + +``` file +executable = goatbrot +arguments = -i 100000 -c 0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_1.ppm +log = goatbrot.log +output = goatbrot.out.1.1 +error = goatbrot.err.1.1 +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +### montage.sub + +You should notice that the `transfer_input_files` statement refers to the files created by the other jobs. + +``` file +executable = /usr/bin/montage +arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandel-from-dag.jpg +transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm +output = montage.out +error = montage.err +log = montage.log +requirements = OSG_OS_STRING == "RHEL 8" +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +Make your DAG +------------- + +In a file called `goatbrot.dag`, you have your DAG specification: + +``` file +JOB g1 goatbrot1.sub +JOB g2 goatbrot2.sub +JOB g3 goatbrot3.sub +JOB g4 goatbrot4.sub +JOB montage montage.sub +PARENT g1 g2 g3 g4 CHILD montage +``` + +Ask yourself: do you know how we ensure that all the `goatbrot` commands can run simultaneously and all of them will complete before we run the montage job? + +Running the DAG +--------------- + +Submit your DAG: + +``` console +username@learn $ condor_submit_dag goatbrot.dag +----------------------------------------------------------------------- +File for submitting this DAG to Condor : goatbrot.dag.condor.sub +Log of DAGMan debugging messages : goatbrot.dag.dagman.out +Log of Condor library output : goatbrot.dag.lib.out +Log of Condor library error messages : goatbrot.dag.lib.err +Log of the life of condor_dagman itself : goatbrot.dag.dagman.log + +Submitting job(s). +1 job(s) submitted to cluster 71. + +----------------------------------------------------------------------- +``` + +Watch Your DAG +-------------- + +Let’s follow the progress of the whole DAG: + +1. Use the `condor_watch_q` command to keep an eye on the running jobs. See more information about this tool [here](https://htcondor.readthedocs.io/en/latest/man-pages/condor_watch_q.html). + + :::console + username@learn $ condor_watch_q + + **If you're quick enough, you may have seen DAGMan running as the lone job, before it submitted additional job nodes:** + + :::console + BATCH IDLE RUN DONE TOTAL JOB_IDS + goatbrot.dag+222059 - 1 - 1 222059.0 + + [=============================================================================] + + Total: 1 jobs; 1 running + + Updated at 2021-07-28 13:52:57 + + **DAGMan has submitted the goatbrot jobs, but they haven't started running yet** + + :::console + BATCH IDLE RUN DONE TOTAL JOB_IDS + goatbrot.dag+222059 4 1 - 5 222059.0 ... 222063.0 + + [===============--------------------------------------------------------------] + + Total: 5 jobs; 4 idle, 1 running + + Updated at 2021-07-28 13:53:53 + + + **They're running** + + :::console + BATCH IDLE RUN DONE TOTAL JOB_IDS + goatbrot.dag+222059 - 5 - 5 222059.0 ... 222063.0 + [=============================================================================] + + Total: 5 jobs; 5 running + + Updated at 2021-07-28 13:54:33 + + **They finished, but DAGMan hasn't noticed yet. It only checks periodically:** + + :::console + BATCH IDLE RUN DONE TOTAL JOB_IDS + goatbrot.dag+222059 - 1 4 - 5 222059.0 ... 222063.0 + + [##############################################################===============] + + Total: 5 jobs; 4 completed, 1 running + + Updated at 2021-07-28 13:55:13 + + Eventually, you'll see the montage job submitted, then running, then leave the queue, and then DAGMan will leave the queue. + +1. Examine your results. For some reason, goatbrot prints everything to stderr, not stdout. + + :::console + username@learn $ cat goatbrot.err.0.0 + Complex image: Center: -0.75 + 0.75i Width: 1.5 Height: 1.5 Upper Left: -1.5 + 1.5i Lower Right: 0 + 0i + + Output image: Filename: tile_0_0.ppm Width, Height: 500, 500 Theme: beej Antialiased: no + + Mandelbrot: Max Iterations: 100000 Continuous: no + + Goatbrot: Multithreading: not supported in this build + + Completed: 100.0% + +1. Examine your log files (`goatbrot.log` and `montage.log`) and DAGMan output file (`goatbrot.dag.dagman.out`). Do they look as you expect? Can you see the progress of the DAG in the DAGMan output file? +1. As you did earlier, transfer the resulting `mandel-from-dag.jpg` to your computer so that you can view the image. Does the image look correct? +1. Clean up your results by removing all of the `goatbrot.dag.*` files if you like. Be careful to not delete the `goatbrot.dag` file. + +Bonus Challenge +--------------- + +- Re-run your DAG. When jobs are running, try `condor_q -nobatch -dag`. What does it do differently? +- Challenge, if you have time: Make a bigger DAG by making more tiles in the same area. diff --git a/docs/materials/workflows/part1-ex4-failed-dag.md b/docs/materials/workflows/part1-ex4-failed-dag.md new file mode 100644 index 0000000..0011d12 --- /dev/null +++ b/docs/materials/workflows/part1-ex4-failed-dag.md @@ -0,0 +1,252 @@ +--- +status: testing +--- + +Workflows Exercise 1.4: Handling a DAG That Fails +========================= + +The objective of this exercise is to help you learn how DAGMan deals with job failures. DAGMan is built to help you recover from such failures. + +Background +---------- + +DAGMan can handle a situation where some of the nodes in a DAG fail. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed. + +Breaking Things +--------------- + +Recall that DAGMan decides that a jobs fails if its exit code is non-zero. Let's modify our montage job so that it fails. Work in the same directory where you did the last DAG. Edit montage.sub to add a `-h` to the arguments. It will look like this, with the -h at the beginning of the highlighted line: + +```hl_lines="2" +executable = /usr/bin/montage +arguments = -h tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg +transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm +output = montage.out +error = montage.err +log = montage.log +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +Submit the DAG again: + +``` console +username@learn $ condor_submit_dag goatbrot.dag +----------------------------------------------------------------------- +File for submitting this DAG to Condor : goatbrot.dag.condor.sub +Log of DAGMan debugging messages : goatbrot.dag.dagman.out +Log of Condor library output : goatbrot.dag.lib.out +Log of Condor library error messages : goatbrot.dag.lib.err +Log of the life of condor_dagman itself : goatbrot.dag.dagman.log + +Submitting job(s). +1 job(s) submitted to cluster 77. +----------------------------------------------------------------------- +``` + +Use watch to watch the jobs until they finish. In a separate window, use `tail --lines=500 -f goatbrot.dag.dagman.out` to watch what DAGMan does. + +``` console +06/22/12 17:57:41 Setting maximum accepts per cycle 8. +06/22/12 17:57:41 ****************************************************** +06/22/12 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP +06/22/12 17:57:41 ** /usr/bin/condor_dagman +06/22/12 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) +06/22/12 17:57:41 ** Configuration: subsystem:DAGMAN local: class:DAEMON +06/22/12 17:57:41 ** $CondorVersion: 7.7.6 Apr 16 2012 BuildID: 34175 PRE-RELEASE-UWCS $ +06/22/12 17:57:41 ** $CondorPlatform: x86_64_rhap_5.7 $ +06/22/12 17:57:41 ** PID = 26867 +06/22/12 17:57:41 ** Log last touched time unavailable (No such file or directory) +06/22/12 17:57:41 ****************************************************** +06/22/12 17:57:41 Using config source: /etc/condor/condor_config +06/22/12 17:57:41 Using local config sources: +06/22/12 17:57:41 /etc/condor/config.d/00-chtc-global.conf +06/22/12 17:57:41 /etc/condor/config.d/01-chtc-submit.conf +06/22/12 17:57:41 /etc/condor/config.d/02-chtc-flocking.conf +06/22/12 17:57:41 /etc/condor/config.d/03-chtc-jobrouter.conf +06/22/12 17:57:41 /etc/condor/config.d/04-chtc-blacklist.conf +06/22/12 17:57:41 /etc/condor/config.d/99-osg-ss-group.conf +06/22/12 17:57:41 /etc/condor/config.d/99-roy-extras.conf +06/22/12 17:57:41 /etc/condor/condor_config.local +``` +Below is where DAGMan realizes that the montage node failed: + +```console +06/22/12 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0) +06/22/12 18:08:42 Number of idle job procs: 0 +06/22/12 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0) +06/22/12 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0) +06/22/12 18:08:42 Node montage job proc (82.0.0) failed with status 1. +06/22/12 18:08:42 Number of idle job procs: 0 +06/22/12 18:08:42 Of 5 nodes total: +06/22/12 18:08:42 Done Pre Queued Post Ready Un-Ready Failed +06/22/12 18:08:42 === === === === === === === +06/22/12 18:08:42 4 0 0 0 0 0 1 +06/22/12 18:08:42 0 job proc(s) currently held +06/22/12 18:08:42 Aborting DAG... +06/22/12 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001... +06/22/12 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0) +06/22/12 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0) +06/22/12 18:08:42 Note: 0 total job deferrals because of node category throttles +06/22/12 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) +06/22/12 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) +06/22/12 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1 +``` + +DAGMan notices that one of the jobs failed because its exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved. Do you see the part where it wrote the rescue DAG? + +Look at the rescue DAG file. It's called a partial DAG because it indicates what part of the DAG has already been completed. + +``` console +username@learn $ cat goatbrot.dag.rescue001 +# Rescue DAG file, created after running +# the goatbrot.dag DAG file +# Created 6/22/2012 23:08:42 UTC +# Rescue DAG version: 2.0.1 (partial) +# +# Total number of Nodes: 5 +# Nodes premarked DONE: 4 +# Nodes that failed: 1 +# montage, + +DONE g1 +DONE g2 +DONE g3 +DONE g4 +``` + +From the comment near the top, we know that the montage node failed. Let's fix it by getting rid of the offending `-h` argument. Change montage.sub to look like: + +``` file +executable = /usr/bin/montage +arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg +transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm +output = montage.out +error = montage.err +log = montage.log +request_memory = 1GB +request_disk = 1GB +request_cpus = 1 +queue +``` + +Now we can re-submit our original DAG and DAGMan will pick up where it left off. It will automatically notice the rescue DAG. If you didn't fix the problem, DAGMan would generate another rescue DAG. + +``` console +username@learn $ condor_submit_dag goatbrot.dag +Running rescue DAG 1 +----------------------------------------------------------------------- +File for submitting this DAG to Condor : goatbrot.dag.condor.sub +Log of DAGMan debugging messages : goatbrot.dag.dagman.out +Log of Condor library output : goatbrot.dag.lib.out +Log of Condor library error messages : goatbrot.dag.lib.err +Log of the life of condor_dagman itself : goatbrot.dag.dagman.log + +Submitting job(s). +1 job(s) submitted to cluster 83. +----------------------------------------------------------------------- + +username@learn $ tail -f goatbrot.dag.dagman.out +06/23/12 11:30:53 ****************************************************** +06/23/12 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP +06/23/12 11:30:53 ** /usr/bin/condor_dagman +06/23/12 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) +06/23/12 11:30:53 ** Configuration: subsystem:DAGMAN local: class:DAEMON +06/23/12 11:30:53 ** $CondorVersion: 7.7.6 Apr 16 2012 BuildID: 34175 PRE-RELEASE-UWCS $ +06/23/12 11:30:53 ** $CondorPlatform: x86_64_rhap_5.7 $ +06/23/12 11:30:53 ** PID = 28576 +06/23/12 11:30:53 ** Log last touched 6/22 18:08:42 +06/23/12 11:30:53 ****************************************************** +06/23/12 11:30:53 Using config source: /etc/condor/condor_config +... +``` + +**Here is where DAGMAN notices that there is a rescue DAG** + +```hl_lines="3" +06/23/12 11:30:53 Parsing 1 dagfiles +06/23/12 11:30:53 Parsing goatbrot.dag ... +06/23/12 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file +06/23/12 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +06/23/12 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001 +06/23/12 11:30:53 Dag contains 5 total jobs +``` + +**Shortly thereafter it sees that four jobs have already finished.** + +```console +06/23/12 11:31:05 Bootstrapping... +06/23/12 11:31:05 Number of pre-completed nodes: 4 +06/23/12 11:31:05 Registering condor_event_timer... +06/23/12 11:31:06 Sleeping for one second for log file consistency +06/23/12 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log +``` + +**Here is where DAGMan resubmits the montage job and waits for it to complete.** + +```console +06/23/12 11:31:07 Submitting Condor Node montage job(s)... +06/23/12 11:31:07 submitting: condor_submit + -a dag_node_name' '=' 'montage + -a +DAGManJobId' '=' '83 + -a DAGManJobId' '=' '83 + -a submit_event_notes' '=' 'DAG' 'Node:' 'montage + -a DAG_STATUS' '=' '0 + -a FAILED_COUNT' '=' '0 + -a +DAGParentNodeNames' '=' '"g1,g2,g3,g4" + montage.sub +06/23/12 11:31:07 From submit: Submitting job(s). +06/23/12 11:31:07 From submit: 1 job(s) submitted to cluster 84. +06/23/12 11:31:07 assigned Condor ID (84.0.0) +06/23/12 11:31:07 Just submitted 1 job this cycle... +06/23/12 11:31:07 Currently monitoring 1 Condor log file(s) +06/23/12 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0) +06/23/12 11:31:07 Number of idle job procs: 1 +06/23/12 11:31:07 Of 5 nodes total: +06/23/12 11:31:07 Done Pre Queued Post Ready Un-Ready Failed +06/23/12 11:31:07 === === === === === === === +06/23/12 11:31:07 4 0 1 0 0 0 0 +06/23/12 11:31:07 0 job proc(s) currently held +06/23/12 11:40:22 Currently monitoring 1 Condor log file(s) +06/23/12 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0) +06/23/12 11:40:22 Number of idle job procs: 0 +06/23/12 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0) +06/23/12 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0) +``` + +**This is where the montage finished.** + +```console +06/23/12 11:40:22 Node montage job proc (84.0.0) completed successfully. +06/23/12 11:40:22 Node montage job completed +06/23/12 11:40:22 Number of idle job procs: 0 +06/23/12 11:40:22 Of 5 nodes total: +06/23/12 11:40:22 Done Pre Queued Post Ready Un-Ready Failed +06/23/12 11:40:22 === === === === === === === +06/23/12 11:40:22 5 0 0 0 0 0 0 +06/23/12 11:40:22 0 job proc(s) currently held +``` + +**And here DAGMan decides that the work is all done.** + +```console +06/23/12 11:40:22 All jobs Completed! +06/23/12 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0) +06/23/12 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0) +06/23/12 11:40:22 Note: 0 total job deferrals because of node category throttles +06/23/12 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) +06/23/12 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0) +06/23/12 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0 +``` + +Success! Now go ahead and clean up. + +Bonus Challenge +--------------- + +If you have time, add an extra node to the DAG. Copy our original "simple" program, but make it exit with a 1 instead of a 0. DAGMan would consider this a failure, but you'll tell DAGMan that it's really a success. This is reasonable--many real world programs use a variety of return codes, and you might need to help DAGMan distinguish success from failure. + +Write a POST script that checks the return value. Check [the HTCondor manual](https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-scripts.html#pre-and-post-scripts) to see how to describe your post script. + diff --git a/docs/materials/workflows/part1-ex5-challenges.md b/docs/materials/workflows/part1-ex5-challenges.md new file mode 100644 index 0000000..16512bd --- /dev/null +++ b/docs/materials/workflows/part1-ex5-challenges.md @@ -0,0 +1,51 @@ +--- +status: testing +--- + + + +# Bonus Workflows Exercise 1.5: YOUR Jobs and More on Workflows + +The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job. + +Challenge 1 +----------- + +Do you have any extra computation that needs to be done? Real work, from your life outside this summer school? If so, try it out on our HTCondor pool. Can't think of something? How about one of the existing distributed computing programs like [distributed.net](http://www.distributed.net), [SETI@home](http://setiathome.ssl.berkeley.edu/), [Einstien@Home](http://www.einsteinathome.org/) or others that you know. We prefer that you do your own work rather than one of these projects, but they are options. + +Challenge 2 +----------- + +Try to generate other Mandelbrot images. Some possible locations to look at with goatbroat: + +``` console +goatbrot -i 1000 -o ex1.ppm -c 0.0016437219722,-0.8224676332988 -w 2e-11 -s 1000,1000 +goatbrot -i 1000 -o ex2.ppm -c 0.3958608398437499,-0.13431445312500012 -w 0.0002197265625 -s 1000,1000 +goatbrot -i 1000 -o ex3.ppm -c 0.3965859374999999,-0.13378125000000013 -w 0.003515625 -s 1000,1000 +``` + +You can convert ppm files with `convert`, like so: + +``` console +convert ex1.ppm ex1.jpg +``` + +Now make a movie! Make a series of images where you zoom into a point in the Mandelbrot set gradually. (Those points above may work well.) Assemble these images with the "convert" tool which will let you convert a set of JPEG files into an MPEG movie. + +Challenge 3 +----------- + +Try out Pegasus. Pegasus is a workflow manager that uses DAGMan and can work in a grid environment and/or run across different types of clusters (with other queueing software). It will create the DAGs from abstract DAG descriptions and ensure they are appropriate for the location of the data and computation. + +Links to more information: + +- [Pegasus Website](https://pegasus.isi.edu) +- [Pegasus Documentation](https://pegasus.isi.edu/documentation) +- [Pegasus on OSG](https://portal.osg-htc.org/documentation/htc_workloads/automated_workflows/tutorial-pegasus/) + +If you have any questions or problems, please feel free to contact the Pegasus team by emailing + +