-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #6 from osg-htc/copy-materials
Copy over materials
- Loading branch information
Showing
88 changed files
with
6,990 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
|
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
--- | ||
status: testing | ||
--- | ||
|
||
# Self-Checkpointing Exercise 1.1: Trying It Out | ||
|
||
The goal of this exercise is to practice writing a submit file for self-checkpointing, | ||
and to see the process in action. | ||
|
||
## Calculating Fibonacci numbers … slowly | ||
|
||
The sample code for this exercise calculates | ||
[the Fibonacci number](https://en.wikipedia.org/wiki/Fibonacci_number) | ||
resulting from a given set of iterations. | ||
Because this is a trival computation, | ||
the code includes a delay in each iteration through the main loop; | ||
this simulates a more intensive computation. | ||
|
||
To get set up: | ||
|
||
1. Log in to `ap40.uw.osg-htc.org` | ||
(`ap1` is fine, too) | ||
|
||
1. Create and change into a new directory for this exercise | ||
|
||
1. Download the Python script that is the main executable for this exercise: | ||
|
||
:::console | ||
user@server $ wget https://raw.githubusercontent.com/osg-htc/user-school-2022/main/src/checkpointing/fibonacci.py | ||
|
||
1. If you want to run the script directly, make it executable first: | ||
|
||
:::console | ||
user@server $ chmod 0755 fibonacci.py | ||
|
||
Take a look at the code, if you like. | ||
It is not very elegant, but it gets the job done. | ||
|
||
A few notes: | ||
|
||
* The script takes a single argument, the number of iterations to run. | ||
To minimize computing time while leaving time to explore, `10` is a good number of iterations. | ||
|
||
* The script checkpoints every other iteration through the main loop. | ||
The exit status code for a checkpoint is 85. | ||
|
||
* It prints some output to standard out along the way, to let you know what is going on. | ||
|
||
* The final result is written to a separate file named `fibonacci.result`. | ||
This file does not exist until the very end of the complete run. | ||
|
||
* It is safe to run from the command line on an access point: | ||
|
||
:::console | ||
user@server $ ./fibonacci.py 10 | ||
|
||
If you run it, what happens? (Due to the 30-second delay, be patient.) | ||
Can you explain its behavior? | ||
What happens if you run it again, without changing any files in between? Why? | ||
|
||
## Preparing to run | ||
|
||
Now you have an executable and you know how to run it. | ||
It is time to prepare it for submission to HTCondor! | ||
|
||
Using what you know about the script (above), | ||
and using information in the slides from today, | ||
try writing a submit file that runs this software and | ||
implements exit-driven self-checkpointing. | ||
The Python code itself is ready and should not need any changes. | ||
|
||
Just use a plain `queue` statement, one job is enough to experiment on. | ||
|
||
**Before you submit,** read the next section first! | ||
|
||
## Running and monitoring | ||
|
||
With the 30-second delay per iteration in the code and the suggested 10 iterations, | ||
once the script starts running you have about 5 minutes of runtime in which to see what is going on. | ||
So it may help to read through this section *and then* return here and submit your job. | ||
|
||
If your job has problems or finishes before you have the chance to do all the steps below, | ||
just remove the extra files (besides the Python script and your submit file) and try again! | ||
|
||
### Submission and first checkpoint | ||
|
||
1. Submit the job | ||
1. Look at the contents of the submit directory — what changed? | ||
1. Start watching the log file: `tail -n 100 -f YOUR-LOG-FILENAME.log` | ||
|
||
Be patient! As HTCondor adds more lines to the end of your log file, they will appear automatically. | ||
Thus, nothing much will happen until HTCondor starts running your job. | ||
When it does, you will see three sets of messages in the log file quickly: | ||
|
||
* `Started transferring input files` | ||
* `Finished transferring input files` | ||
* `Job executing on host:` | ||
|
||
(Of course, each message will contain a lot of other characters!) | ||
|
||
Now wait about 1 minute, and you should see two more messages appear: | ||
|
||
* `Started transferring output files` | ||
* `Finished transferring output files` | ||
|
||
That is the first checkpoint happening! | ||
|
||
### Forcing your job to stop running | ||
|
||
Now, assuming that your job is still running (check `condor_q` again), | ||
you can force HTCondor to remove (*evict*) your job before it finishes: | ||
|
||
1. Run `condor_q` to get the job ID of the running job | ||
1. Run `condor_vacate_job JOB_ID`, where you replace `JOB_ID` with your job ID from above | ||
1. Monitor the action again by running `tail -n 100 -f YOUR-LOG-FILENAME.log` | ||
|
||
### Finishing the job and wrap-up | ||
|
||
Be patient again! | ||
You removed your running job, and so HTCondor put it back in the queue as idle. | ||
If you wait a minute or two, you should see that HTCondor starts running the job again. | ||
|
||
1. In the log file, look carefully for the two `Job executing on host:` messages. | ||
Does it seem like you ran on the same computer again or on a different one? | ||
Both are possible! | ||
|
||
1. Let your job finish running this time. | ||
There should be a `Job terminated of its own accord` message near the end. | ||
|
||
1. Did you get results? Go through all the files and see what they contain. | ||
The log and output files are probably the most interesting. | ||
But did you get a result file, too? | ||
|
||
Did the output file — | ||
that is, whatever file you named in the `output` line of your submit file — | ||
contain *everything* that you expected it to? | ||
|
||
## Conclusion | ||
|
||
This has been a brief and simple tour of self-checkpointing. | ||
If you would like to learn more, please read | ||
[the Self-Checkpointing Applications section](https://htcondor.readthedocs.io/en/latest/users-manual/self-checkpointing-applications.html) | ||
of the HTCondor Manual. | ||
Or talk to School staff about it. | ||
Or contact [email protected] for further help at any time. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
--- | ||
status: testing | ||
--- | ||
|
||
Data Exercise 1.1: Understanding Data Requirements | ||
=============================== | ||
|
||
Exercise Goal | ||
------------- | ||
|
||
This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a | ||
large batch of jobs or using tools for delivering large data to jobs. | ||
In this exercise we will attempt to understand the input and output of the bioinformatics application | ||
[BLAST](http://blast.ncbi.nlm.nih.gov/). | ||
|
||
Setup | ||
----- | ||
|
||
For this exercise, we will use the `ap40.uw.osg-htc.org` access point. Log in: | ||
|
||
``` hl_lines="1" | ||
$ ssh <USERNAME>@ap40.uw.osg-htc.org | ||
``` | ||
|
||
Create a directory for this exercise named `blast-data` and change into it | ||
|
||
### Copy the Input Files ### | ||
|
||
To run BLAST, we need the executable, input file, and reference | ||
database. For this example, we'll use the "pdbaa" database, which | ||
contains sequences for the protein structure from the Protein Data Bank. | ||
For our input file, we'll use an abbreviated fasta file with mouse | ||
genome information. | ||
|
||
1. Copy the BLAST executables: | ||
|
||
:::console | ||
user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ncbi-blast-2.12.0+-x64-linux.tar.gz | ||
user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz | ||
|
||
1. Download these files to your current directory: | ||
|
||
:::console | ||
user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/pdbaa.tar.gz | ||
user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/mouse.fa | ||
|
||
1. Untar the `pdbaa` database: | ||
|
||
:::console | ||
user@ap40 $ tar -xzvf pdbaa.tar.gz | ||
|
||
Understanding BLAST | ||
------------------- | ||
|
||
Remember that `blastx` is executed in a command like the following: | ||
|
||
``` console | ||
user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db <DATABASE ROOTNAME> -query <INPUT FILE> -out <RESULTS FILE> | ||
``` | ||
|
||
In the above, the `<INPUT FILE>` is the name of a file containing a number of genetic sequences (e.g. `mouse.fa`), and | ||
the database that these are compared against is made up of several files that begin with the same `<DATABASE ROOTNAME>`, | ||
(e.g. `pdbaa/pdbaa`). | ||
The output from this analysis will be printed to `<RESULTS FILE>` that is also indicated in the command. | ||
|
||
Calculating Data Needs | ||
---------------------- | ||
|
||
Using the files that you prepared in `blast-data`, we will calculate how much disk space is needed if we were to | ||
run a hypothetical BLAST job with a wrapper script, where the job: | ||
|
||
- Transfers all of its input files (including the executable) as tarballs | ||
- Untars the input files tarballs on the execute host | ||
- Runs `blastx` using the untarred input files | ||
|
||
Here are some commands that will be useful for calculating your job's storage needs: | ||
|
||
- List the size of a specific file: | ||
|
||
:::console | ||
user@ap40 $ ls -lh <FILE NAME> | ||
|
||
- List the sizes of all files in the current directory: | ||
|
||
:::console | ||
user@ap40 $ ls -lh | ||
|
||
- Sum the size of all files in a specific directory: | ||
|
||
:::console | ||
user@ap40 $ du -sh <DIRECTORY> | ||
|
||
### Input requirements | ||
|
||
Total up the amount of data in all of the files necessary to run the `blastx` wrapper job, including the executable itself. | ||
Write down this number. | ||
Also take note of how much total data is in the `pdbaa` directory. | ||
|
||
!!! note "Compressed Files" | ||
Remember, `blastx` reads the un-compressed `pdbaa` files. | ||
|
||
### Output requirements | ||
|
||
The output that we care about from `blastx` is saved in the file whose name is indicated after the `-out` argument to | ||
`blastx`. | ||
Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too. | ||
Are there any other files? | ||
Total all of these together, as well. | ||
|
||
<!-- | ||
#removed for 2020 Virtual school since (I assume) we won't have a group discussion forum | ||
Talk about this as a group! | ||
--------------------------- | ||
Once you have completed the above tasks, we'll talk about the totals as a group. | ||
- How much disk space is required on the submit server for one blastx run with the input files you used before? | ||
(Input data) | ||
- How much disk space is required on the worker node? (uncompressed + output data) | ||
- How *many* files are needed and created for each run? (Output data) | ||
- How much total disk space would be necessary on the submit server to run 10 jobs? | ||
(remember that some of the files will be shared by all 10 jobs, and will not be multiplied) | ||
Answers | ||
------- | ||
- Submit server: Only compressed files needed. Don't need uncompressed on submit server node. | ||
- pdbaa.tar.gz: 22MB | ||
- blastx.tar.gz: 14MB | ||
- mouse.fa.tar.gz: 104K | ||
- Total: ~36MB | ||
- Worker Node: Compressed files + uncompressed files | ||
- pdbaa: 97MB | ||
- blastx: 39MB | ||
- mouse.fa: 389KB | ||
- results: 11MB | ||
- stdout: 0 | ||
- stderr: 0 | ||
- Compressed files: ~36MB | ||
- Total: ~183MB | ||
- How many files are needed and created for each run? | ||
- files in pdbaa: 12 | ||
- blastx: 1 | ||
- mouse.fa: 1 | ||
- results: 1 | ||
- stdout + stderr = 2 | ||
- total: 17 | ||
- Submit server with 10 jobs | ||
- Only need multiple queries, because that is what is different. | ||
- so pdbaa (22MB) + blastx (14MB) + 10 * mouse.fa (104k) = ~37MB | ||
--> | ||
|
||
<!-- | ||
## Removed 2019, not sure how users are supposed to reasonably get this info | ||
- Assuming that each file is read completely by BLAST, and since you know how long blastx runs (time it): | ||
- At what rate are files read in? | ||
- How many MB/s? | ||
- Rates: | ||
- my run, and this can vary: 198 seconds | ||
- 17 / 198 = 0.086 files per second (low) | ||
- 149 / 198 = 0.75 MBs per second | ||
--> | ||
|
||
Up next! | ||
-------- | ||
|
||
Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. | ||
[Next Exercise](../part1-ex2-file-transfer) |
Oops, something went wrong.