Merge pull request #6 from osg-htc/copy-materials

Copy over materials
osg-htc · Jun 20, 2024 · 428c35f · 428c35f
2 parents d156fb9 + bff19cf
commit 428c35f
Show file tree

Hide file tree

Showing 88 changed files with 6,990 additions and 0 deletions.
diff --git a/docs/materials/checkpoint/files/.empty b/docs/materials/checkpoint/files/.empty
@@ -0,0 +1 @@
+
diff --git a/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pdf b/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pdf
diff --git a/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pptx b/docs/materials/checkpoint/files/OSGUS2023_checkpointing.pptx
diff --git a/docs/materials/checkpoint/part1-ex1-checkpointing.md b/docs/materials/checkpoint/part1-ex1-checkpointing.md
@@ -0,0 +1,145 @@
+---
+status: testing
+---
+
+# Self-Checkpointing Exercise 1.1: Trying It Out
+
+The goal of this exercise is to practice writing a submit file for self-checkpointing,
+and to see the process in action.
+
+## Calculating Fibonacci numbers &hellip; slowly
+
+The sample code for this exercise calculates
+[the Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number)
+resulting from a given set of iterations.
+Because this is a trival computation,
+the code includes a delay in each iteration through the main loop;
+this simulates a more intensive computation.
+
+To get set up:
+
+1.  Log in to `ap40.uw.osg-htc.org`
+    (`ap1` is fine, too)
+
+1.  Create and change into a new directory for this exercise
+
+1.  Download the Python script that is the main executable for this exercise:
+
+        :::console
+        user@server $ wget https://raw.githubusercontent.com/osg-htc/user-school-2022/main/src/checkpointing/fibonacci.py
+
+1.  If you want to run the script directly, make it executable first:
+
+        :::console
+        user@server $ chmod 0755 fibonacci.py
+
+Take a look at the code, if you like.
+It is not very elegant, but it gets the job done.
+
+A few notes:
+
+*   The script takes a single argument, the number of iterations to run.
+    To minimize computing time while leaving time to explore, `10` is a good number of iterations.
+
+*   The script checkpoints every other iteration through the main loop.
+    The exit status code for a checkpoint is 85.
+
+*   It prints some output to standard out along the way, to let you know what is going on.
+
+*   The final result is written to a separate file named `fibonacci.result`.
+    This file does not exist until the very end of the complete run.
+
+*   It is safe to run from the command line on an access point:
+
+        :::console
+        user@server $ ./fibonacci.py 10
+
+    If you run it, what happens?  (Due to the 30-second delay, be patient.)
+    Can you explain its behavior?
+    What happens if you run it again, without changing any files in between?  Why?
+
+## Preparing to run
+
+Now you have an executable and you know how to run it.
+It is time to prepare it for submission to HTCondor!
+
+Using what you know about the script (above),
+and using information in the slides from today,
+try writing a submit file that runs this software and
+implements exit-driven self-checkpointing.
+The Python code itself is ready and should not need any changes.
+
+Just use a plain `queue` statement, one job is enough to experiment on.
+
+**Before you submit,** read the next section first!
+
+## Running and monitoring
+
+With the 30-second delay per iteration in the code and the suggested 10 iterations,
+once the script starts running you have about 5 minutes of runtime in which to see what is going on.
+So it may help to read through this section *and then* return here and submit your job.
+
+If your job has problems or finishes before you have the chance to do all the steps below,
+just remove the extra files (besides the Python script and your submit file) and try again!
+
+### Submission and first checkpoint
+
+1.  Submit the job
+1.  Look at the contents of the submit directory&nbsp;— what changed?
+1.  Start watching the log file: `tail -n 100 -f YOUR-LOG-FILENAME.log`
+
+Be patient!  As HTCondor adds more lines to the end of your log file, they will appear automatically.
+Thus, nothing much will happen until HTCondor starts running your job.
+When it does, you will see three sets of messages in the log file quickly:
+
+*   `Started transferring input files`
+*   `Finished transferring input files`
+*   `Job executing on host:`
+
+(Of course, each message will contain a lot of other characters!)
+
+Now wait about 1 minute, and you should see two more messages appear:
+
+*   `Started transferring output files`
+*   `Finished transferring output files`
+
+That is the first checkpoint happening!
+
+### Forcing your job to stop running
+
+Now, assuming that your job is still running (check `condor_q` again),
+you can force HTCondor to remove (*evict*) your job before it finishes:
+
+1.  Run `condor_q` to get the job ID of the running job
+1.  Run `condor_vacate_job JOB_ID`, where you replace `JOB_ID` with your job ID from above
+1.  Monitor the action again by running `tail -n 100 -f YOUR-LOG-FILENAME.log`
+
+### Finishing the job and wrap-up
+
+Be patient again!
+You removed your running job, and so HTCondor put it back in the queue as idle.
+If you wait a minute or two, you should see that HTCondor starts running the job again.
+
+1.  In the log file, look carefully for the two `Job executing on host:` messages.
+    Does it seem like you ran on the same computer again or on a different one?
+    Both are possible!
+
+1.  Let your job finish running this time.
+    There should be a `Job terminated of its own accord` message near the end.
+
+1.  Did you get results?  Go through all the files and see what they contain.
+    The log and output files are probably the most interesting.
+    But did you get a result file, too?
+
+Did the output file&nbsp;—
+that is, whatever file you named in the `output` line of your submit file&nbsp;—
+contain *everything* that you expected it to?
+
+## Conclusion
+
+This has been a brief and simple tour of self-checkpointing.
+If you would like to learn more, please read
+[the Self-Checkpointing Applications section](https://htcondor.readthedocs.io/en/latest/users-manual/self-checkpointing-applications.html)
+of the HTCondor Manual.
+Or talk to School staff about it.
+Or contact [email protected] for further help at any time.
diff --git a/docs/materials/data/files/osgus18-day4-part2-ex2-data-transfer.jpg b/docs/materials/data/files/osgus18-day4-part2-ex2-data-transfer.jpg
diff --git a/docs/materials/data/files/osgus19-day4-part2-CacheLocations.png b/docs/materials/data/files/osgus19-day4-part2-CacheLocations.png
diff --git a/docs/materials/data/files/osgus23-data.pdf b/docs/materials/data/files/osgus23-data.pdf
diff --git a/docs/materials/data/files/osgus23-data.pptx b/docs/materials/data/files/osgus23-data.pptx
diff --git a/docs/materials/data/part1-ex1-data-needs.md b/docs/materials/data/part1-ex1-data-needs.md
@@ -0,0 +1,171 @@
+---
+status: testing
+---
+
+Data Exercise 1.1: Understanding Data Requirements
+===============================
+
+Exercise Goal 
+-------------
+
+This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a
+large batch of jobs or using tools for delivering large data to jobs.
+In this exercise we will attempt to understand the input and output of the bioinformatics application
+[BLAST](http://blast.ncbi.nlm.nih.gov/).
+
+Setup
+-----
+
+For this exercise, we will use the `ap40.uw.osg-htc.org` access point. Log in:
+
+``` hl_lines="1"
+$ ssh <USERNAME>@ap40.uw.osg-htc.org
+```
+
+Create a directory for this exercise named `blast-data` and change into it
+
+### Copy the Input Files ###
+
+To run BLAST, we need the executable, input file, and reference
+database. For this example, we'll use the "pdbaa" database, which
+contains sequences for the protein structure from the Protein Data Bank.
+For our input file, we'll use an abbreviated fasta file with mouse
+genome information.
+
+1. Copy the BLAST executables:
+
+        :::console
+        user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ncbi-blast-2.12.0+-x64-linux.tar.gz
+        user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz
+
+1.  Download these files to your current directory:
+
+        :::console
+        user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/pdbaa.tar.gz
+        user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/mouse.fa
+
+1.  Untar the `pdbaa` database:
+
+        :::console
+        user@ap40 $ tar -xzvf pdbaa.tar.gz
+
+Understanding BLAST
+-------------------
+
+Remember that `blastx` is executed in a command like the following:
+
+``` console
+user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db <DATABASE ROOTNAME> -query <INPUT FILE> -out <RESULTS FILE>
+```
+
+In the above, the `<INPUT FILE>` is the name of a file containing a number of genetic sequences (e.g. `mouse.fa`), and
+the database that these are compared against is made up of several files that begin with the same `<DATABASE ROOTNAME>`,
+(e.g. `pdbaa/pdbaa`).
+The output from this analysis will be printed to `<RESULTS FILE>` that is also indicated in the command.
+
+Calculating Data Needs
+----------------------
+
+Using the files that you prepared in `blast-data`, we will calculate how much disk space is needed if we were to
+run a hypothetical BLAST job with a wrapper script, where the job:
+
+- Transfers all of its input files (including the executable) as tarballs
+- Untars the input files tarballs on the execute host
+- Runs `blastx` using the untarred input files
+
+Here are some commands that will be useful for calculating your job's storage needs:
+
+- List the size of a specific file:
+
+        :::console
+        user@ap40 $ ls -lh <FILE NAME>
+
+- List the sizes of all files in the current directory:
+
+        :::console
+        user@ap40 $ ls -lh
+
+- Sum the size of all files in a specific directory:
+
+        :::console
+        user@ap40 $ du -sh <DIRECTORY>
+
+### Input requirements
+
+Total up the amount of data in all of the files necessary to run the `blastx` wrapper job, including the executable itself.
+Write down this number.
+Also take note of how much total data is in the `pdbaa` directory.
+
+!!! note "Compressed Files"
+    Remember, `blastx` reads the un-compressed `pdbaa` files.
+
+### Output requirements
+
+The output that we care about from `blastx` is saved in the file whose name is indicated after the `-out` argument to
+`blastx`.
+Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too.
+Are there any other files?
+Total all of these together, as well.
+
+<!--
+#removed for 2020 Virtual school since (I assume) we won't have a group discussion forum 
+Talk about this as a group!
+---------------------------
+
+Once you have completed the above tasks, we'll talk about the totals as a group.
+
+-   How much disk space is required on the submit server for one blastx run with the input files you used before?
+    (Input data)
+-   How much disk space is required on the worker node? (uncompressed + output data)
+-   How *many* files are needed and created for each run? (Output data)
+-   How much total disk space would be necessary on the submit server to run 10 jobs?
+    (remember that some of the files will be shared by all 10 jobs, and will not be multiplied)
+
+Answers
+-------
+
+- Submit server: Only compressed files needed.  Don't need uncompressed on submit server node.
+    - pdbaa.tar.gz: 22MB
+    - blastx.tar.gz: 14MB
+    - mouse.fa.tar.gz: 104K
+    - Total: ~36MB
+- Worker Node: Compressed files + uncompressed files
+    - pdbaa: 97MB
+    - blastx: 39MB
+    - mouse.fa: 389KB
+    - results: 11MB
+    - stdout: 0
+    - stderr: 0
+    - Compressed files: ~36MB
+    - Total: ~183MB
+- How many files are needed and created for each run?
+    - files in pdbaa: 12
+    - blastx: 1
+    - mouse.fa: 1
+    - results: 1
+    - stdout + stderr = 2
+    - total: 17
+- Submit server with 10 jobs
+    - Only need multiple queries, because that is what is different.
+    - so pdbaa (22MB) + blastx (14MB) + 10 * mouse.fa (104k) = ~37MB
+
+-->
+
+<!--
+## Removed 2019, not sure how users are supposed to reasonably get this info
+-   Assuming that each file is read completely by BLAST, and since you know how long blastx runs (time it):
+    -   At what rate are files read in?
+    -   How many MB/s?
+- Rates:
+    - my run, and this can vary: 198 seconds
+    - 17 / 198 = 0.086 files per second (low)
+    - 149 / 198 = 0.75 MBs per second
+
+
+-->
+
+Up next!
+--------
+
+Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes.
+[Next Exercise](../part1-ex2-file-transfer)