From b11d06be84f45e09cf2074638c1f5207d25eb28e Mon Sep 17 00:00:00 2001 From: <> Date: Sat, 3 Aug 2024 02:30:40 +0000 Subject: [PATCH] Deployed 1610cb0 with MkDocs version: 1.1.2 --- .../part1-ex2-job-retry/index.html | 2 +- search/search_index.json | 2 +- sitemap.xml.gz | Bin 816 -> 816 bytes 3 files changed, 2 insertions(+), 2 deletions(-) diff --git a/materials/troubleshooting/part1-ex2-job-retry/index.html b/materials/troubleshooting/part1-ex2-job-retry/index.html index 2846801..11ac3e1 100644 --- a/materials/troubleshooting/part1-ex2-job-retry/index.html +++ b/materials/troubleshooting/part1-ex2-job-retry/index.html @@ -1685,7 +1685,7 @@

Bad JobRetrying Failed Jobs

Now let’s see if we can solve the problem of jobs that fail once in a while. In this particular case, if HTCondor runs a failed job again, it has a good chance of succeeding. Not all failing jobs are like this, but in this case it is a reasonable assumption.

-

From the lecture materials, implement the max_retries feature to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Did your change work?

+

HTcondor has a feature named max_retries that allows to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Try implementing this feature. Did your change work?

After the jobs have finished, examine the log file(s) to see what happened in detail. Did any jobs need to be restarted? Another way to see how many restarts there were is to look at the NumJobStarts attribute of a completed job with the condor_history command, in the same way you looked at the ExitCode attribute earlier. Does the number of retries seem correct? For those jobs which did need to be retried, what is their ExitCode; and what about the ExitCode from earlier execution attempts?

A (Too) Long Running Job

Sometimes, an ill-behaved job will get stuck in a loop and run forever, instead of exiting with a failure code, and it may just need to be re-run (or run on a different execute server) to complete without getting stuck. We can modify our Python program to simulate this kind of bad job with the following file:

diff --git a/search/search_index.json b/search/search_index.json index 8f7b4d0..0172f3e 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"OSG School 2024 \u00b6 Could you transform your research with vast amounts of computing? Learn how this summer at the lovely University of Wisconsin\u2013Madison During the School, August 5\u20139 , you will learn to use high-throughput computing (HTC) systems \u2014 at your own campus or using the national-scale Open Science Pool \u2014 to run large-scale computing applications that are at the heart of today\u2019s cutting-edge science. Through lectures, discussions, and lots of hands-on activities with experienced OSG staff, you will learn how HTC systems work, how to run and manage lots of jobs and huge datasets to implement a scientific computing workflow, and where to get more information and help. The school is ideal for: Researchers (especially graduate students and post-docs) in any research area for which large-scale computing is a vital part of the research process; Anyone (especially students and staff) who supports researchers who are current or potential users of high-throughput computing; Instructors (at the post-secondary level) who teach future researchers and see value in integrating high-throughput computing into their curriculum. People accepted to this program will receive financial support for basic travel and local costs associated with the School. Applications \u00b6 Applications are now closed for 2024. The deadline for applications was Monday, 1 April 2024. If still needed, have someone email a letter of recommendation for you to school@osg-htc.org (ideally PDF or plain text) For the letter of recommendation, ask someone who knows you professionally \u2014 ideally a faculty member or other supervisor. They should clearly identify your name and the \u201cOSG School 2024\u201d in the subject line and letter, so that we can associate your application and letter. Applicants: We plan to review applications in April and invite participants by early May or so. We will contact you once decisions have been made. Thank you for your patience! Contact Us \u00b6 The OSG School is the premier training event of the OSG Consortium and is held annually at UW\u2013Madison. If you have any questions about the event, feel free to email us: school@osg-htc.org OSGSchool * Image provided by Wikimedia user Av9 under Creative Commons License","title":"Home"},{"location":"#osg-school-2024","text":"Could you transform your research with vast amounts of computing? Learn how this summer at the lovely University of Wisconsin\u2013Madison During the School, August 5\u20139 , you will learn to use high-throughput computing (HTC) systems \u2014 at your own campus or using the national-scale Open Science Pool \u2014 to run large-scale computing applications that are at the heart of today\u2019s cutting-edge science. Through lectures, discussions, and lots of hands-on activities with experienced OSG staff, you will learn how HTC systems work, how to run and manage lots of jobs and huge datasets to implement a scientific computing workflow, and where to get more information and help. The school is ideal for: Researchers (especially graduate students and post-docs) in any research area for which large-scale computing is a vital part of the research process; Anyone (especially students and staff) who supports researchers who are current or potential users of high-throughput computing; Instructors (at the post-secondary level) who teach future researchers and see value in integrating high-throughput computing into their curriculum. People accepted to this program will receive financial support for basic travel and local costs associated with the School.","title":"OSG School 2024"},{"location":"#applications","text":"Applications are now closed for 2024. The deadline for applications was Monday, 1 April 2024. If still needed, have someone email a letter of recommendation for you to school@osg-htc.org (ideally PDF or plain text) For the letter of recommendation, ask someone who knows you professionally \u2014 ideally a faculty member or other supervisor. They should clearly identify your name and the \u201cOSG School 2024\u201d in the subject line and letter, so that we can associate your application and letter. Applicants: We plan to review applications in April and invite participants by early May or so. We will contact you once decisions have been made. Thank you for your patience!","title":"Applications"},{"location":"#contact-us","text":"The OSG School is the premier training event of the OSG Consortium and is held annually at UW\u2013Madison. If you have any questions about the event, feel free to email us: school@osg-htc.org OSGSchool * Image provided by Wikimedia user Av9 under Creative Commons License","title":"Contact Us"},{"location":"health/","text":"Health Guidelines \u00b6 The OSG School 2024 at the UW\u2013Madison welcomes participants from around the United States plus India, Mali, and Uganda. This page contains health guidelines for this year\u2019s School. While the focus is in COVID-19, most of these guidelines also apply to preventing the spread of other infectious illnesses (flu, colds, GI viruses, etc.). It is very important to us that everyone stays safe and healthy throughout the whole School. We will have the best event possible if everyone stays well! There are no hard rules here, just a reminder that we are all in this together . If you have any questions, concerns, or comments about these guidelines, please email us at school@osg-htc.org or message us on Slack. Before Traveling to the School \u00b6 If you tested positive for COVID recently (past 2 weeks or so), please follow CDC guidelines for what to do when sick. Even if you have no symptoms or known exposure, consider taking a rapid test before traveling to improve the odds that you are not bringing COVID to the event. If you DO test positive before the School, or if you do not feel well enough to travel for any reason, please let us know immediately so we can accommodate (see below for remote participation options). While in Madison \u00b6 Wearing a mask is welcome at the School itself when indoors or in other poorly ventilated areas. We can provide a few high-quality KN95 masks for people who would like them and have not brought their own. We encourage everyone to consider outdoor dining options when reasonable \u2014 not just for reducing risk, but also because Madison is beautiful in the summer! While in Madison, if you feel unwell, stay home or at the hotel. When you can, let School staff know why you are absent \u2014 by email or Slack \u2014 and if you would like to keep up with exercises and lectures, we will help support you remotely (see below). If you experience possible symptoms of COVID-19 , or test positive for COVID-19, follow CDC guidelines for what to do when sick. Remote Attendance \u00b6 If you are in Madison and are sick or quarantined, or if you are not able to travel to Madison, we will do our best to support you via remote attendance. We learned a lot about remote events during the pandemic! We can: Try to stream lectures live over Zoom Post all slides and exercises on the website Be active on Slack and email Conduct one-on-one consultations over Zoom As long as you feel up to it, we will do our best to support you during the School.","title":"Health Guidelines"},{"location":"health/#health-guidelines","text":"The OSG School 2024 at the UW\u2013Madison welcomes participants from around the United States plus India, Mali, and Uganda. This page contains health guidelines for this year\u2019s School. While the focus is in COVID-19, most of these guidelines also apply to preventing the spread of other infectious illnesses (flu, colds, GI viruses, etc.). It is very important to us that everyone stays safe and healthy throughout the whole School. We will have the best event possible if everyone stays well! There are no hard rules here, just a reminder that we are all in this together . If you have any questions, concerns, or comments about these guidelines, please email us at school@osg-htc.org or message us on Slack.","title":"Health Guidelines"},{"location":"health/#before-traveling-to-the-school","text":"If you tested positive for COVID recently (past 2 weeks or so), please follow CDC guidelines for what to do when sick. Even if you have no symptoms or known exposure, consider taking a rapid test before traveling to improve the odds that you are not bringing COVID to the event. If you DO test positive before the School, or if you do not feel well enough to travel for any reason, please let us know immediately so we can accommodate (see below for remote participation options).","title":"Before Traveling to the School"},{"location":"health/#while-in-madison","text":"Wearing a mask is welcome at the School itself when indoors or in other poorly ventilated areas. We can provide a few high-quality KN95 masks for people who would like them and have not brought their own. We encourage everyone to consider outdoor dining options when reasonable \u2014 not just for reducing risk, but also because Madison is beautiful in the summer! While in Madison, if you feel unwell, stay home or at the hotel. When you can, let School staff know why you are absent \u2014 by email or Slack \u2014 and if you would like to keep up with exercises and lectures, we will help support you remotely (see below). If you experience possible symptoms of COVID-19 , or test positive for COVID-19, follow CDC guidelines for what to do when sick.","title":"While in Madison"},{"location":"health/#remote-attendance","text":"If you are in Madison and are sick or quarantined, or if you are not able to travel to Madison, we will do our best to support you via remote attendance. We learned a lot about remote events during the pandemic! We can: Try to stream lectures live over Zoom Post all slides and exercises on the website Be active on Slack and email Conduct one-on-one consultations over Zoom As long as you feel up to it, we will do our best to support you during the School.","title":"Remote Attendance"},{"location":"schedule/","text":"August 4 (Sunday) \u00b6 Welcome Dinner for Participants and Staff All School participants and staff are encourage to attend! Time: Starting at 6:30 p.m. Location : Fluno Center , 601 University Avenue; Skyview Room, 8th floor There is construction all around the Fluno Center; use the map below to get to the entrance on University Avenue: Rachel, one of the School staff, will be in the hotel lobby to lead a group walking to the Fluno. She plans to arrive at the Park Hotel at about 5:40 p.m., and then the whole group will leave at about 6:00 p.m. Join the walking group, if you like! Otherwise, you are welcome to walk on your own, to get a ride (maybe even the hotel shuttle will be available), or to get there however you like. August 5 (Monday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:15 Welcome to the OSG School Tim C. 9:15 9:30 Lecture: Introduction to High Throughput Computing Christina 9:30 9:45 Exercise: Scaling Out Computing Worksheet - 9:45 10:15 Lecture: Introduction to HTCondor Andrew 10:15 10:30 Exercise: Log in - 10:30 10:45 Break - 10:45 12:15 Exercises: HTCondor basics (1.n series) - 12:15 13:15 Lunch in Computer Sciences (near 1240) - 13:15 14:00 Lecture: More HTCondor Andrew 14:15 15:00 Exercises: Many jobs (2.n series) - 15:00 15:15 Break - 15:15 15:30 Lecture: Setting goals for the School and beyond Rachel 15:30 17:00 Exercises: Goals and unfinished exercises Individual consultations - 19:00 20:30 Evening work sessions (optional) Memorial Union \u2013 Council Room (4th Floor) Note: Free, outdoor showing of Jaws (1975) at 9 p.m.! Rachel, Christina, Tim August 6 (Tuesday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:45 Lecture: Introduction to dHTC and the OSPool Tim C. 9:45 10:30 Exercises: Using the OSPool - 10:30 10:45 Break Travel document collection, [if needed](logistics/visas.md) - 10:45 11:30 Lecture: Troubleshooting jobs Showmic 11:30 12:15 Exercises: Basic troubleshooting tools - 12:15 13:30 Lunch in Computer Sciences (near 1240) 13:15: Return documents in 1240 - 13:30 14:45 Interactive: High Throughput Computing in action staff 14:45 15:00 Break - 15:00 15:45 Lecture: Software portability Rachel 15:45 17:00 Excersies: Software and unfinished exercises Individual consultations staff 19:00 20:30 Evening work session (optional) Christina, Amber, Tim August 7 (Wednesday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Working with data Andrew 9:45 10:45 Exercises: Data - 10:45 11:00 Break - 11:00 12:00 HTC Showcase Part 1 \u25b6 Michael Gerard ; Nuclear Engineering & Engineering Physics \u201cUsing CHTC to optimize the Helically Symmetric eXperiment stellarator\u201d \u25b6 Bryce Johnson ; Morgridge Institute for Research & UW\u2013Madison Computer Sciences \u201cRunning millions of biophysical simulations with OSPool\u201d - 12:00 12:30 Open Q&A and discussion time staff 12:30 13:45 Lunch, Computer Sciences (Staff to direct) Optional Domain Lunches: Christina (math); Rachel (biology); Andrew/Amber (chemistry); Ian (ML); Tim (physics & astronomy) - 13:45 17:00 Afternoon off or optional work time Individual consultations staff 19:00 20:30 Evening work session (optional) Tim C., Showmic August 8 (Thursday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Scaling Up/Independence in Research Computing Christina 9:45 10:45 Exercises: Scaling up - 10:45 11:00 Break - 11:00 12:00 Lecture DAGMan Rachel 12:00 13:15 Lunch, Computer Sciences (Staff to direct) - 13:15 14:30 Exercises: DAGMan Work Time: Apply HTC to own research Individual consultations staff 14:30 14:45 Break - 14:45 15:45 Work Time: Apply HTC to own research Individual consultations staff 15:45 16:30 Lecture: Machine Learning Ian 19:00 20:30 Evening work session (optional) Andrew, Showmic August 9 (Friday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Checkpointing Work Time: Apply HTC to own research Showmic 9:45 10:30 Work time: Apply HTC to own research Individual consultations staff 10:30 10:45 Break - 10:45 11:30 Work time: Apply HTC to own research - 11:30 11:50 Group photo (details TBD) - 11:50 13:00 Lunch, Computer Sciences (Staff to direct) Optional: Introduction to Research Computing Facilitation Christina 13:00 14:00 HTC Showcase, Part 2 \u25b6 Saloni Bhogale ; \u201cTBD\u201d \u25b6 Dan Wright ; Civil & Environmental Engineering \u201cComputational hydroclimate research enabled by HTC\u201d - 14:00 14:30 Open Q&A Work time: Apply HTC to own research Break - 14:30 15:30 Lightning talks by volunteer participants Attendees 15:30 16:00 Open Q&A and work time staff 16:00 16:45 HTC and HTCondor Philosophy Greg? 16:45 17:15 Lecture: Forward Tim C.","title":"Schedule"},{"location":"schedule/#august-4-sunday","text":"Welcome Dinner for Participants and Staff All School participants and staff are encourage to attend! Time: Starting at 6:30 p.m. Location : Fluno Center , 601 University Avenue; Skyview Room, 8th floor There is construction all around the Fluno Center; use the map below to get to the entrance on University Avenue: Rachel, one of the School staff, will be in the hotel lobby to lead a group walking to the Fluno. She plans to arrive at the Park Hotel at about 5:40 p.m., and then the whole group will leave at about 6:00 p.m. Join the walking group, if you like! Otherwise, you are welcome to walk on your own, to get a ride (maybe even the hotel shuttle will be available), or to get there however you like.","title":"August 4 (Sunday)"},{"location":"schedule/#august-5-monday","text":"Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:15 Welcome to the OSG School Tim C. 9:15 9:30 Lecture: Introduction to High Throughput Computing Christina 9:30 9:45 Exercise: Scaling Out Computing Worksheet - 9:45 10:15 Lecture: Introduction to HTCondor Andrew 10:15 10:30 Exercise: Log in - 10:30 10:45 Break - 10:45 12:15 Exercises: HTCondor basics (1.n series) - 12:15 13:15 Lunch in Computer Sciences (near 1240) - 13:15 14:00 Lecture: More HTCondor Andrew 14:15 15:00 Exercises: Many jobs (2.n series) - 15:00 15:15 Break - 15:15 15:30 Lecture: Setting goals for the School and beyond Rachel 15:30 17:00 Exercises: Goals and unfinished exercises Individual consultations - 19:00 20:30 Evening work sessions (optional) Memorial Union \u2013 Council Room (4th Floor) Note: Free, outdoor showing of Jaws (1975) at 9 p.m.! Rachel, Christina, Tim","title":"August 5 (Monday)"},{"location":"schedule/#august-6-tuesday","text":"Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:45 Lecture: Introduction to dHTC and the OSPool Tim C. 9:45 10:30 Exercises: Using the OSPool - 10:30 10:45 Break Travel document collection, [if needed](logistics/visas.md) - 10:45 11:30 Lecture: Troubleshooting jobs Showmic 11:30 12:15 Exercises: Basic troubleshooting tools - 12:15 13:30 Lunch in Computer Sciences (near 1240) 13:15: Return documents in 1240 - 13:30 14:45 Interactive: High Throughput Computing in action staff 14:45 15:00 Break - 15:00 15:45 Lecture: Software portability Rachel 15:45 17:00 Excersies: Software and unfinished exercises Individual consultations staff 19:00 20:30 Evening work session (optional) Christina, Amber, Tim","title":"August 6 (Tuesday)"},{"location":"schedule/#august-7-wednesday","text":"Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Working with data Andrew 9:45 10:45 Exercises: Data - 10:45 11:00 Break - 11:00 12:00 HTC Showcase Part 1 \u25b6 Michael Gerard ; Nuclear Engineering & Engineering Physics \u201cUsing CHTC to optimize the Helically Symmetric eXperiment stellarator\u201d \u25b6 Bryce Johnson ; Morgridge Institute for Research & UW\u2013Madison Computer Sciences \u201cRunning millions of biophysical simulations with OSPool\u201d - 12:00 12:30 Open Q&A and discussion time staff 12:30 13:45 Lunch, Computer Sciences (Staff to direct) Optional Domain Lunches: Christina (math); Rachel (biology); Andrew/Amber (chemistry); Ian (ML); Tim (physics & astronomy) - 13:45 17:00 Afternoon off or optional work time Individual consultations staff 19:00 20:30 Evening work session (optional) Tim C., Showmic","title":"August 7 (Wednesday)"},{"location":"schedule/#august-8-thursday","text":"Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Scaling Up/Independence in Research Computing Christina 9:45 10:45 Exercises: Scaling up - 10:45 11:00 Break - 11:00 12:00 Lecture DAGMan Rachel 12:00 13:15 Lunch, Computer Sciences (Staff to direct) - 13:15 14:30 Exercises: DAGMan Work Time: Apply HTC to own research Individual consultations staff 14:30 14:45 Break - 14:45 15:45 Work Time: Apply HTC to own research Individual consultations staff 15:45 16:30 Lecture: Machine Learning Ian 19:00 20:30 Evening work session (optional) Andrew, Showmic","title":"August 8 (Thursday)"},{"location":"schedule/#august-9-friday","text":"Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Checkpointing Work Time: Apply HTC to own research Showmic 9:45 10:30 Work time: Apply HTC to own research Individual consultations staff 10:30 10:45 Break - 10:45 11:30 Work time: Apply HTC to own research - 11:30 11:50 Group photo (details TBD) - 11:50 13:00 Lunch, Computer Sciences (Staff to direct) Optional: Introduction to Research Computing Facilitation Christina 13:00 14:00 HTC Showcase, Part 2 \u25b6 Saloni Bhogale ; \u201cTBD\u201d \u25b6 Dan Wright ; Civil & Environmental Engineering \u201cComputational hydroclimate research enabled by HTC\u201d - 14:00 14:30 Open Q&A Work time: Apply HTC to own research Break - 14:30 15:30 Lightning talks by volunteer participants Attendees 15:30 16:00 Open Q&A and work time staff 16:00 16:45 HTC and HTCondor Philosophy Greg? 16:45 17:15 Lecture: Forward Tim C.","title":"August 9 (Friday)"},{"location":"logistics/","text":"OSG School 2024 Logistics \u00b6 The following pages describe some of the important information about your visit to Madison for the OSG School. Please read them carefully. There will be other pages with local details soon. Visa requirements for non-residents Travel planning to and from Madison Hotel information As always: If you have questions, email us at school@osg-htc.org . Use that email address for all emails about the organization of the School. General Information About the School Schedule \u00b6 Travel Schedule \u00b6 Most participants should plan to travel as follows: Arrive on Sunday, August 4, 2024, by about 5:00 p.m. (if possible). There is a welcome dinner on Sunday evening for all participants (including instructors), and then classes begin on Monday morning. This is a nice way to get to know each other and start the week. Depart on Saturday, August 10, 2024, any time. The School ends with a closing dinner on Friday evening, so it is best to stay that night. If we offered to pay for your hotel room, we will pay for the 6 nights of this schedule. Note: If we suggested other travel dates to you in an email, then use those dates instead! School Hours \u00b6 The School is generally Monday through Friday, 9:00 a.m. to about 5:00 p.m., except Wednesday afternoon. There will be optional work sessions on Monday, Tuesday, Wednesday, and Thursday evenings. A detailed schedule will be posted before the event. Contact Information \u00b6 If you have questions, do not wait to contact us! school@osg-htc.org","title":"General information"},{"location":"logistics/#osg-school-2024-logistics","text":"The following pages describe some of the important information about your visit to Madison for the OSG School. Please read them carefully. There will be other pages with local details soon. Visa requirements for non-residents Travel planning to and from Madison Hotel information As always: If you have questions, email us at school@osg-htc.org . Use that email address for all emails about the organization of the School.","title":"OSG School 2024 Logistics"},{"location":"logistics/#general-information-about-the-school-schedule","text":"","title":"General Information About the School Schedule"},{"location":"logistics/#travel-schedule","text":"Most participants should plan to travel as follows: Arrive on Sunday, August 4, 2024, by about 5:00 p.m. (if possible). There is a welcome dinner on Sunday evening for all participants (including instructors), and then classes begin on Monday morning. This is a nice way to get to know each other and start the week. Depart on Saturday, August 10, 2024, any time. The School ends with a closing dinner on Friday evening, so it is best to stay that night. If we offered to pay for your hotel room, we will pay for the 6 nights of this schedule. Note: If we suggested other travel dates to you in an email, then use those dates instead!","title":"Travel Schedule"},{"location":"logistics/#school-hours","text":"The School is generally Monday through Friday, 9:00 a.m. to about 5:00 p.m., except Wednesday afternoon. There will be optional work sessions on Monday, Tuesday, Wednesday, and Thursday evenings. A detailed schedule will be posted before the event.","title":"School Hours"},{"location":"logistics/#contact-information","text":"If you have questions, do not wait to contact us! school@osg-htc.org","title":"Contact Information"},{"location":"logistics/account-setup/","text":".hi { font-weight: bold; color: #FF6600; } Apply for Computing Access \u00b6 We will be using two different Access Points during the OSG School - ap40.uw.osg-htc.org and ap1.facility.path-cc.io . As soon as possible please request your account access using this link: OSG School Account Registration Instructions on setting up your account can be found using this guide: Log in to uw.osg-htc.org Access Points We strongly recommend going through the registration process and trying to log in before the School, ideally before your OSG orientation session. If you run into problems contact us at support@osg-htc.org .","title":"Account setup"},{"location":"logistics/account-setup/#apply-for-computing-access","text":"We will be using two different Access Points during the OSG School - ap40.uw.osg-htc.org and ap1.facility.path-cc.io . As soon as possible please request your account access using this link: OSG School Account Registration Instructions on setting up your account can be found using this guide: Log in to uw.osg-htc.org Access Points We strongly recommend going through the registration process and trying to log in before the School, ideally before your OSG orientation session. If you run into problems contact us at support@osg-htc.org .","title":"Apply for Computing Access"},{"location":"logistics/dining/","text":"Dining \u00b6 The School provides some catered meals as a group, and you are on your own for others. When on your own, there are many dining options in Madison between the School and your hotel, especially on State Street which is only blocks away from both locations. Restaurants right on and very near to the Capitol Square, onto which the hotel faces, tend to be a little more expensive. As you go toward campus on State Street or neighboring streets, prices tend to go down. But of course, there are exceptions in both directions! It is reasonable to ask to see a menu before ordering or being seated and decide whether to stay. Food Options Near the Hotel \u00b6 Use a mapping app or rating services like Yelp to look food options. For example: Food Options Near the School \u00b6 There are not a lot of great food options very close to the School, but feel free to ask School staff for suggestions.","title":"Dining options"},{"location":"logistics/dining/#dining","text":"The School provides some catered meals as a group, and you are on your own for others. When on your own, there are many dining options in Madison between the School and your hotel, especially on State Street which is only blocks away from both locations. Restaurants right on and very near to the Capitol Square, onto which the hotel faces, tend to be a little more expensive. As you go toward campus on State Street or neighboring streets, prices tend to go down. But of course, there are exceptions in both directions! It is reasonable to ask to see a menu before ordering or being seated and decide whether to stay.","title":"Dining"},{"location":"logistics/dining/#food-options-near-the-hotel","text":"Use a mapping app or rating services like Yelp to look food options. For example:","title":"Food Options Near the Hotel"},{"location":"logistics/dining/#food-options-near-the-school","text":"There are not a lot of great food options very close to the School, but feel free to ask School staff for suggestions.","title":"Food Options Near the School"},{"location":"logistics/fun-day/","text":"Fun Activity Ideas While in Madison \u00b6 Free \u00b6 Narrated tour via UW-Madison app : Discover UW\u2013Madison using our free mobile app featuring a student-led narrated tour that is self-guided. The tour includes information about buildings, academics, transportation, housing, and all things surrounding the student experience. Start at Union South, 1308 W Dayton St (~1 minute walk from School) UW\u2013Madison Geology Museum : Large collection of geological specimens. Across Dayton Street from the School building. 1215 Dayton Street (~2 minute walk from School) L.R. Ingersoll Physics Museum : Small museum of Physics objects and demonstrations. Very short walk from the School building: Chamberlin Hall, 1150 University Avenue. (~6 minute walk from School) Terrace Open Mic Night : Enjoy a night out where all styles of music, comedy, spoken word, poetry, and more take the stage. Performances start at 7 PM on Wednesday. 800 Langdon Street (~15 minute walk from School) Tour of Wisconsin State Capitol : Tours start at 1, 2, 3, and 4 p.m. and last about 45 minutes. 2 E Main Street (~29 minute walk from School and across the Park Hotel) Henry Vilas Zoo : One-mile walk south of Computer Sciences: 702 South Randall Avenue. (~18 minute walk from School) Take a stroll or a ride on The Lakeshore Path : Reach the infamous Picnic Point or take your trip to the Arboretum! Cost \u00b6 Rent a Bcycle : Take advantage of Madison's many bike paths throughout Madison. Camp Randall Guided Tour : 1440 Monroe St; Tour starts promptly at 2:30 PM on Wednesday and will approximately last one hour; $10 per person (~8 minute walk from School) Paddling rentals on Lake Mendota : Paddling rentals, including paddleboards, kayak, and canoes. Memorial Union Terrace, $18 per hour. 800 Langdon Street (~15 minute walk from School) Tour of First Unitarian Society\u2019s Meeting House : The Landmark Auditorium was designed by Frank Lloyd Wright. $15 per person ($12.50 if booked online in advance), up to 10 people. 900 University Bay Drive (~38 minute walk; Bus accessible, with close stop) Olbrich Botanical Gardens : 16 acres outdoor (FREE); indoor: $6 conservatory; $8 butterfly house. 3330 Atwood Avenue (~15 minute drive from School; Bus accessible, with close stop) Disclaimer \u00b6 The Chazen Museum of Art has a summer closure from August 5th-9th. Madison Museum of Contemporary Art (MMoCA) is closed on Wednesdays.","title":"Madison Fun Day"},{"location":"logistics/fun-day/#fun-activity-ideas-while-in-madison","text":"","title":"Fun Activity Ideas While in Madison"},{"location":"logistics/fun-day/#free","text":"Narrated tour via UW-Madison app : Discover UW\u2013Madison using our free mobile app featuring a student-led narrated tour that is self-guided. The tour includes information about buildings, academics, transportation, housing, and all things surrounding the student experience. Start at Union South, 1308 W Dayton St (~1 minute walk from School) UW\u2013Madison Geology Museum : Large collection of geological specimens. Across Dayton Street from the School building. 1215 Dayton Street (~2 minute walk from School) L.R. Ingersoll Physics Museum : Small museum of Physics objects and demonstrations. Very short walk from the School building: Chamberlin Hall, 1150 University Avenue. (~6 minute walk from School) Terrace Open Mic Night : Enjoy a night out where all styles of music, comedy, spoken word, poetry, and more take the stage. Performances start at 7 PM on Wednesday. 800 Langdon Street (~15 minute walk from School) Tour of Wisconsin State Capitol : Tours start at 1, 2, 3, and 4 p.m. and last about 45 minutes. 2 E Main Street (~29 minute walk from School and across the Park Hotel) Henry Vilas Zoo : One-mile walk south of Computer Sciences: 702 South Randall Avenue. (~18 minute walk from School) Take a stroll or a ride on The Lakeshore Path : Reach the infamous Picnic Point or take your trip to the Arboretum!","title":"Free"},{"location":"logistics/fun-day/#cost","text":"Rent a Bcycle : Take advantage of Madison's many bike paths throughout Madison. Camp Randall Guided Tour : 1440 Monroe St; Tour starts promptly at 2:30 PM on Wednesday and will approximately last one hour; $10 per person (~8 minute walk from School) Paddling rentals on Lake Mendota : Paddling rentals, including paddleboards, kayak, and canoes. Memorial Union Terrace, $18 per hour. 800 Langdon Street (~15 minute walk from School) Tour of First Unitarian Society\u2019s Meeting House : The Landmark Auditorium was designed by Frank Lloyd Wright. $15 per person ($12.50 if booked online in advance), up to 10 people. 900 University Bay Drive (~38 minute walk; Bus accessible, with close stop) Olbrich Botanical Gardens : 16 acres outdoor (FREE); indoor: $6 conservatory; $8 butterfly house. 3330 Atwood Avenue (~15 minute drive from School; Bus accessible, with close stop)","title":"Cost"},{"location":"logistics/fun-day/#disclaimer","text":"The Chazen Museum of Art has a summer closure from August 5th-9th. Madison Museum of Contemporary Art (MMoCA) is closed on Wednesdays.","title":"Disclaimer"},{"location":"logistics/hotel/","text":".hi { font-weight: bold; color: #FF6600; } Hotel Information \u00b6 We reserved a block of rooms at an area hotel for participants from outside Madison. Best Western Premier Park Hotel 22 South Carroll Street, Madison, WI +1 (608) 285\u20118000 Please note: We will reserve your room for you, so do not contact the hotel yourself to reserve a room. Exceptions to this rule are rare and clearly communicated. Other important hotel information: Before the School, we will send you an email with your hotel confirmation number We pay only for basic room costs \u2014 you must provide a credit card to cover extra costs There is one School participant per room; to have friends or family stay with you, please ask us now Check-In Time \u00b6 The (earliest) check-in time at the hotel is 4 p.m. on your day of arrival. If you are arriving earlier, you have options: Ask the hotel if it is possible to check in earlier than 4 p.m. It is up to the hotel to decide if they can meet your request. If there is any additional expense required, you must pay that yourself. Ask the hotel to put your bags in a safe spot and enjoy Madison until 4 p.m. or later. Keep your bags with you and enjoy Madison until 4 p.m. or later. Check-Out Time \u00b6 The (latest) check-out time from the hotel is 11 a.m. on your day of departure. If you are leaving later, you have options: Ask the hotel to put your bags in a safe spot and enjoy Madison until it is time to leave. Keep your bags with you and enjoy Madison until it is time to leave. You are not required to travel directly from the hotel to the airport, but if you do, we may be able to help you arrange to use the free hotel shuttle.","title":"Hotel information"},{"location":"logistics/hotel/#hotel-information","text":"We reserved a block of rooms at an area hotel for participants from outside Madison. Best Western Premier Park Hotel 22 South Carroll Street, Madison, WI +1 (608) 285\u20118000 Please note: We will reserve your room for you, so do not contact the hotel yourself to reserve a room. Exceptions to this rule are rare and clearly communicated. Other important hotel information: Before the School, we will send you an email with your hotel confirmation number We pay only for basic room costs \u2014 you must provide a credit card to cover extra costs There is one School participant per room; to have friends or family stay with you, please ask us now","title":"Hotel Information"},{"location":"logistics/hotel/#check-in-time","text":"The (earliest) check-in time at the hotel is 4 p.m. on your day of arrival. If you are arriving earlier, you have options: Ask the hotel if it is possible to check in earlier than 4 p.m. It is up to the hotel to decide if they can meet your request. If there is any additional expense required, you must pay that yourself. Ask the hotel to put your bags in a safe spot and enjoy Madison until 4 p.m. or later. Keep your bags with you and enjoy Madison until 4 p.m. or later.","title":"Check-In Time"},{"location":"logistics/hotel/#check-out-time","text":"The (latest) check-out time from the hotel is 11 a.m. on your day of departure. If you are leaving later, you have options: Ask the hotel to put your bags in a safe spot and enjoy Madison until it is time to leave. Keep your bags with you and enjoy Madison until it is time to leave. You are not required to travel directly from the hotel to the airport, but if you do, we may be able to help you arrange to use the free hotel shuttle.","title":"Check-Out Time"},{"location":"logistics/local-transportation/","text":"Local Transportation \u00b6 You are responsible for your own transportation within Madison, but we will help coordinate and can reimburse costs between the airport and your hotel. Travel Between the Madison Airport and Your Hotel \u00b6 For travel between the Madison airport (Dane County Regional Airport) and the School hotel , the best option is the hotel shuttle service, when available. Otherwise, you may use a ride-sharing service or taxi. See below for details. We will help organize groups to take shuttles and taxis, based on arrival and departure times. Shuttle/taxi groups will be formed and emailed shortly before the School itself. Travel Between the Hotel and Campus \u00b6 For travel between the School hotel and the Computer Sciences building on campus, walking is a great option. Also, the hotel shuttle service may be available, especially if organized in advance. See below for details. Options for Getting Around \u00b6 Hotel Shuttle \u00b6 The Park Hotel operates a free shuttle service. The shuttle may not be available at all times, though, and it is best to plan ahead. Work with the hotel staff, individually or even better in groups, to use the shuttle. As noted above, we will help organize groups for the shuttle for airport arrivals on Sunday and departures on Saturday. To ask about the shuttle, either stop by the front desk of the hotel, or call +1 (608) 285-8000 and press 0 for the front desk. Explain that you are a guest at the hotel and ask if the shuttle is available for the number of people in your group; be clear about where you want to go from and to and at what time. We will send the hotel our list of groups who would like the shuttle for airport trips, but it is still best for the leader of each group to check with the hotel anyway. Walking \u00b6 It is easy to walk in and around the University of Wisconsin\u2013Madison campus, with many Madison landmarks within a mile of the School and your hotel. Use a mapping app or ask us or your hotel for a map. In particular, State Street \u2014 which connects the Capitol Square with the UW campus \u2014 is full of great restaurants and shops and is worth walking along while you are here. City of Madison Metro Bus Service \u00b6 Many Madison Metro buses stop near the hotel and pass through the University of Wisconsin\u2013Madison campus. Bus fare is $2.00, and if using a transfer ask the driver for a free transfer pass upon boarding. Google Maps is a great resource for finding the best bus routes to use in Madison, giving multiple route options for each trip. Additionally, the Madison Metro Website provides a web interface to plan your trip. Note Bus routes stop running around ~11pm each day. Taxis and Ride-Sharing Services \u00b6 Both Lyft and Uber are active in Madison, or you can choose from our local taxi companies, such as Madison Taxi and Union Cab . We cannot recommend any particular option, but those are some options we know about. Note We cannot reimburse for any taxi or rideshare service beyond the ride to and from the airport. Note We will need receipts for any ride-share or taxi fare over $25. Madison BCycle \u00b6 Madison is a great city to bike in, and there is even a short-term bike rental system called BCycle . Bcycles are available throughout the city , including near the hotel and around campus. Pricing for Bcycles can be found on their website and consist of several tiers. Note Unfortunately, we are not able to reimburse BCycle costs.","title":"Local transportation"},{"location":"logistics/local-transportation/#local-transportation","text":"You are responsible for your own transportation within Madison, but we will help coordinate and can reimburse costs between the airport and your hotel.","title":"Local Transportation"},{"location":"logistics/local-transportation/#travel-between-the-madison-airport-and-your-hotel","text":"For travel between the Madison airport (Dane County Regional Airport) and the School hotel , the best option is the hotel shuttle service, when available. Otherwise, you may use a ride-sharing service or taxi. See below for details. We will help organize groups to take shuttles and taxis, based on arrival and departure times. Shuttle/taxi groups will be formed and emailed shortly before the School itself.","title":"Travel Between the Madison Airport and Your Hotel"},{"location":"logistics/local-transportation/#travel-between-the-hotel-and-campus","text":"For travel between the School hotel and the Computer Sciences building on campus, walking is a great option. Also, the hotel shuttle service may be available, especially if organized in advance. See below for details.","title":"Travel Between the Hotel and Campus"},{"location":"logistics/local-transportation/#options-for-getting-around","text":"","title":"Options for Getting Around"},{"location":"logistics/local-transportation/#hotel-shuttle","text":"The Park Hotel operates a free shuttle service. The shuttle may not be available at all times, though, and it is best to plan ahead. Work with the hotel staff, individually or even better in groups, to use the shuttle. As noted above, we will help organize groups for the shuttle for airport arrivals on Sunday and departures on Saturday. To ask about the shuttle, either stop by the front desk of the hotel, or call +1 (608) 285-8000 and press 0 for the front desk. Explain that you are a guest at the hotel and ask if the shuttle is available for the number of people in your group; be clear about where you want to go from and to and at what time. We will send the hotel our list of groups who would like the shuttle for airport trips, but it is still best for the leader of each group to check with the hotel anyway.","title":"Hotel Shuttle"},{"location":"logistics/local-transportation/#walking","text":"It is easy to walk in and around the University of Wisconsin\u2013Madison campus, with many Madison landmarks within a mile of the School and your hotel. Use a mapping app or ask us or your hotel for a map. In particular, State Street \u2014 which connects the Capitol Square with the UW campus \u2014 is full of great restaurants and shops and is worth walking along while you are here.","title":"Walking"},{"location":"logistics/local-transportation/#city-of-madison-metro-bus-service","text":"Many Madison Metro buses stop near the hotel and pass through the University of Wisconsin\u2013Madison campus. Bus fare is $2.00, and if using a transfer ask the driver for a free transfer pass upon boarding. Google Maps is a great resource for finding the best bus routes to use in Madison, giving multiple route options for each trip. Additionally, the Madison Metro Website provides a web interface to plan your trip. Note Bus routes stop running around ~11pm each day.","title":"City of Madison Metro Bus Service"},{"location":"logistics/local-transportation/#taxis-and-ride-sharing-services","text":"Both Lyft and Uber are active in Madison, or you can choose from our local taxi companies, such as Madison Taxi and Union Cab . We cannot recommend any particular option, but those are some options we know about. Note We cannot reimburse for any taxi or rideshare service beyond the ride to and from the airport. Note We will need receipts for any ride-share or taxi fare over $25.","title":"Taxis and Ride-Sharing Services"},{"location":"logistics/local-transportation/#madison-bcycle","text":"Madison is a great city to bike in, and there is even a short-term bike rental system called BCycle . Bcycles are available throughout the city , including near the hotel and around campus. Pricing for Bcycles can be found on their website and consist of several tiers. Note Unfortunately, we are not able to reimburse BCycle costs.","title":"Madison BCycle"},{"location":"logistics/location/","text":"School Location \u00b6 The school will be held at the University of Wisconsin\u2013Madison in the Computer Sciences Building , located at 1210 West Dayton Street, Madison, WI, 53706 . This location is about 1.3 miles from your hotel. The main classroom is Room 1240 (see below). See the local transportation page for suggestions about getting around Madison. Computer Sciences Building, Room 1240 \u00b6 Most School sessions are held in Room 1240 . If you enter the building from Dayton Street: Enter straight into the building from the street Immediately turn left and go through two sets of doors Pass the elevator (on your right) and walk down the hallway 1240 is on your right up the few steps Generally, just follow signs for 1240. Restrooms \u00b6 There are restrooms across the hallway and a bit to the right of 1240. For those, or other options, just ask staff!","title":"School location"},{"location":"logistics/location/#school-location","text":"The school will be held at the University of Wisconsin\u2013Madison in the Computer Sciences Building , located at 1210 West Dayton Street, Madison, WI, 53706 . This location is about 1.3 miles from your hotel. The main classroom is Room 1240 (see below). See the local transportation page for suggestions about getting around Madison.","title":"School Location"},{"location":"logistics/location/#computer-sciences-building-room-1240","text":"Most School sessions are held in Room 1240 . If you enter the building from Dayton Street: Enter straight into the building from the street Immediately turn left and go through two sets of doors Pass the elevator (on your right) and walk down the hallway 1240 is on your right up the few steps Generally, just follow signs for 1240.","title":"Computer Sciences Building, Room 1240"},{"location":"logistics/location/#restrooms","text":"There are restrooms across the hallway and a bit to the right of 1240. For those, or other options, just ask staff!","title":"Restrooms"},{"location":"logistics/meals/","text":"Meal Information \u00b6 The School includes some group catered meals for all participants: Sunday (Aug. 4) \u2014 welcome dinner Monday (Aug. 5) \u2013 Friday (Aug. 9) \u2014 breakfast and lunch each day Friday (Aug. 9) \u2014 closing dinner Other meals not listed above are on your own. If you are not a member of the UW\u2013Madison community, we will reimburse you for the on-your-own meals, Monday through Thursday dinners; see below for details. Sorry, UW\u2013Madison folks: The rules say that we cannot reimburse you for meals here. For the meals on your own, you are welcome to join other participants and even staff! We can help with ideas and groups, if you like. There is another page with suggestions for finding dining options near the School and hotel. Catered Meals \u00b6 The catered breakfasts and lunches during the School (see above) will be served in the Computer Sciences Building. Breakfast is in the main auditorium, room 1240 , and lunch is nearby (staff will lead the way on Monday). There is nearby seating both inside and outside. Menus \u00b6 The catered meals should take into account all dietary needs that you told us about in the questionnaire. Check for labels! If you have questions, ask the catering staff (if present) or School staff. Some items, like gluten-free items, are provided in low quantities that are meant just for those people who requested them. Please do not take them unless they are for you. Sunday, August 4, 2024 \u00b6 Opening Dinner (6:30 PM - 8:30 PM) \u00b6 Location: Fluno Center - Skyview Room (on the 8th Floor) Cavatappi Pasta Gluten Free Pasta Cheese Lasagna Grilled Chicken Breast Homemade Chicken & Beef Meatballs Italian Vegetable Blend Breadsticks Marinara and Alfredo Sauce Caesar Salad Tiramisu Cannolis Includes Beverage Service Monday, August 5, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Badger Breakfast Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea Lunch (12:15 PM - 1:15 PM) \u00b6 Southwest Buffet Tortilla Chips Red Salsa Spanish Rice Black Beans Beef Barbacoa Chicken Tinga Vegan Chorizo Crumble Flour Tortillas/Corn Tortillas for GF Shredded Lettuce Diced Tomatoes Jalapeno Shredded Cheddar Cheese Sour Cream Guacamole Assorted Soda, Water, and Sparkling Water PM Break (12:30 PM - 4:00 PM) \u00b6 Assorted Soda, Water, and Sparkling Water Regular Coffee Assorted Cookies Gluten Free Cookie Tuesday, August 6, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Fresh Cut Fruit Salad Assorted Muffins Gluten Free Muffin Mini Quiches Turkey Sausage Links Vegan Sausage Patties Assorted Bottled Juices Hot Tea Regular Coffee Lunch (12:15 PM - 1:15 PM) \u00b6 Italian Buffet Caesar Salad (croutons, cheese & Kalamata olives on the side) Caesar Dressing Garlic Breadsticks Pasta Gluten Free Pasta Marinara Sauce Sliced Grilled Chicken Breast Vegan Meatballs Assorted Soda, Water, and Sparkling Water PM Break (12:30 PM - 4:00 PM) \u00b6 Assorted Dessert Bars Granola Bars (GF) Regular Coffee Wednesday, August 7, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea Lunch (12:30 PM - 1:45 PM) \u00b6 Boxed Lunches Chicken Bacon Ranch Wraps Smoked Turkey Sandwiches Southwest Salads Cookies Gluten Free Cookie Assorted Chips Assorted Chips Mediterranean Antipasto Platter Vegetable Platter with Dill Dip Italian Cold Pasta Assorted Soda, Water, and Sparkling Water Thursday, August 8, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Buckingham Breakfast Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Regular Coffee Assorted Bottled Juice Mini Quiches Turkey Sausage Links Vegan Sausage Patties Hot Tea Lunch (12:00 PM - 1:15 PM) \u00b6 Mediteranian Buffet Lemon Oregano Chicken Greek Salad with Olive Oil Vinaigrette Roasted Vegetable Couscous Stuffed Mediterranean Portobello Mushrooms (with and without feta) PM Break (12:30 PM - 4:30 PM) \u00b6 Regular Coffee Assorted Soda, Water, and Sparkling Water Assorted Cookies Gluten Free Cookie Friday, August 9, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Badger Breakfast Turkey Sausage Links Bacon Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Cinnamon Rolls Regular Coffee Assorted Bottled Juice Hot Tea Lunch (12:00 PM - 1:00 PM) \u00b6 Wisconsin Tailgate Garden Salad with Ranch and Balsamic dressing Fried Wedge Potatoes with Ketchup Brats with Kraut Diced Onions Ketchup Dijon Mustard Hamburgers Veggie Burgers Hamburger Buns / Gluten Free Bun Lettuce, Tomato, Onion platter Sliced Cheddar Cheese platter Pickles Ketchup, Mustard, Mayo PM Break (12:30 PM - 4:00 PM) \u00b6 Assorted Soda, Water, and Sparkling Water Regular Coffee Brownies Closing Dinner (6:00 PM - 8:00 PM) \u00b6 Location: Union South - Industry (3rd Floor) Global Buffet Spinach, Strawberry, Shaved Red Onion, Sesame Poppy Seed Dressing Vegetables, Dips, Spreads, Pita Chips Chicken Tikka Masala Sake Salmon Jerk Tofu Basmati Rice Naan Includes choice of coffee station or assorted cold beverages Meal Reimbursement Tips \u00b6 Again, if you are not part of the UW\u2013Madison community, we can reimburse you for dinners Monday through Thursday. We have curated a page of some possible dining options to use as inspiration. Some tips for successful reimbursements: Keep receipts for your meals \u2013 if anything so that you remember how much meals cost! We can reimburse up to $35 for dinner, including tax and tip. If it is not on the receipt, be sure to write the tip amount yourself, so you do not forget. We cannot pay for any alcohol, although non-alcoholic drinks are OK \u2014 ideally, pay for alcohol separately. We will explain the reimbursement process in detail after the School, but the tips above will help.","title":"Meal information"},{"location":"logistics/meals/#meal-information","text":"The School includes some group catered meals for all participants: Sunday (Aug. 4) \u2014 welcome dinner Monday (Aug. 5) \u2013 Friday (Aug. 9) \u2014 breakfast and lunch each day Friday (Aug. 9) \u2014 closing dinner Other meals not listed above are on your own. If you are not a member of the UW\u2013Madison community, we will reimburse you for the on-your-own meals, Monday through Thursday dinners; see below for details. Sorry, UW\u2013Madison folks: The rules say that we cannot reimburse you for meals here. For the meals on your own, you are welcome to join other participants and even staff! We can help with ideas and groups, if you like. There is another page with suggestions for finding dining options near the School and hotel.","title":"Meal Information"},{"location":"logistics/meals/#catered-meals","text":"The catered breakfasts and lunches during the School (see above) will be served in the Computer Sciences Building. Breakfast is in the main auditorium, room 1240 , and lunch is nearby (staff will lead the way on Monday). There is nearby seating both inside and outside.","title":"Catered Meals"},{"location":"logistics/meals/#menus","text":"The catered meals should take into account all dietary needs that you told us about in the questionnaire. Check for labels! If you have questions, ask the catering staff (if present) or School staff. Some items, like gluten-free items, are provided in low quantities that are meant just for those people who requested them. Please do not take them unless they are for you.","title":"Menus"},{"location":"logistics/meals/#sunday-august-4-2024","text":"","title":"Sunday, August 4, 2024"},{"location":"logistics/meals/#opening-dinner-630-pm-830-pm","text":"Location: Fluno Center - Skyview Room (on the 8th Floor) Cavatappi Pasta Gluten Free Pasta Cheese Lasagna Grilled Chicken Breast Homemade Chicken & Beef Meatballs Italian Vegetable Blend Breadsticks Marinara and Alfredo Sauce Caesar Salad Tiramisu Cannolis Includes Beverage Service","title":"Opening Dinner (6:30 PM - 8:30 PM)"},{"location":"logistics/meals/#monday-august-5-2024","text":"","title":"Monday, August 5, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am","text":"Badger Breakfast Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1215-pm-115-pm","text":"Southwest Buffet Tortilla Chips Red Salsa Spanish Rice Black Beans Beef Barbacoa Chicken Tinga Vegan Chorizo Crumble Flour Tortillas/Corn Tortillas for GF Shredded Lettuce Diced Tomatoes Jalapeno Shredded Cheddar Cheese Sour Cream Guacamole Assorted Soda, Water, and Sparkling Water","title":"Lunch (12:15 PM - 1:15 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-400-pm","text":"Assorted Soda, Water, and Sparkling Water Regular Coffee Assorted Cookies Gluten Free Cookie","title":"PM Break (12:30 PM - 4:00 PM)"},{"location":"logistics/meals/#tuesday-august-6-2024","text":"","title":"Tuesday, August 6, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_1","text":"Fresh Cut Fruit Salad Assorted Muffins Gluten Free Muffin Mini Quiches Turkey Sausage Links Vegan Sausage Patties Assorted Bottled Juices Hot Tea Regular Coffee","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1215-pm-115-pm_1","text":"Italian Buffet Caesar Salad (croutons, cheese & Kalamata olives on the side) Caesar Dressing Garlic Breadsticks Pasta Gluten Free Pasta Marinara Sauce Sliced Grilled Chicken Breast Vegan Meatballs Assorted Soda, Water, and Sparkling Water","title":"Lunch (12:15 PM - 1:15 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-400-pm_1","text":"Assorted Dessert Bars Granola Bars (GF) Regular Coffee","title":"PM Break (12:30 PM - 4:00 PM)"},{"location":"logistics/meals/#wednesday-august-7-2024","text":"","title":"Wednesday, August 7, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_2","text":"Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1230-pm-145-pm","text":"Boxed Lunches Chicken Bacon Ranch Wraps Smoked Turkey Sandwiches Southwest Salads Cookies Gluten Free Cookie Assorted Chips Assorted Chips Mediterranean Antipasto Platter Vegetable Platter with Dill Dip Italian Cold Pasta Assorted Soda, Water, and Sparkling Water","title":"Lunch (12:30 PM - 1:45 PM)"},{"location":"logistics/meals/#thursday-august-8-2024","text":"","title":"Thursday, August 8, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_3","text":"Buckingham Breakfast Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Regular Coffee Assorted Bottled Juice Mini Quiches Turkey Sausage Links Vegan Sausage Patties Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1200-pm-115-pm","text":"Mediteranian Buffet Lemon Oregano Chicken Greek Salad with Olive Oil Vinaigrette Roasted Vegetable Couscous Stuffed Mediterranean Portobello Mushrooms (with and without feta)","title":"Lunch (12:00 PM - 1:15 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-430-pm","text":"Regular Coffee Assorted Soda, Water, and Sparkling Water Assorted Cookies Gluten Free Cookie","title":"PM Break (12:30 PM - 4:30 PM)"},{"location":"logistics/meals/#friday-august-9-2024","text":"","title":"Friday, August 9, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_4","text":"Badger Breakfast Turkey Sausage Links Bacon Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Cinnamon Rolls Regular Coffee Assorted Bottled Juice Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1200-pm-100-pm","text":"Wisconsin Tailgate Garden Salad with Ranch and Balsamic dressing Fried Wedge Potatoes with Ketchup Brats with Kraut Diced Onions Ketchup Dijon Mustard Hamburgers Veggie Burgers Hamburger Buns / Gluten Free Bun Lettuce, Tomato, Onion platter Sliced Cheddar Cheese platter Pickles Ketchup, Mustard, Mayo","title":"Lunch (12:00 PM - 1:00 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-400-pm_2","text":"Assorted Soda, Water, and Sparkling Water Regular Coffee Brownies","title":"PM Break (12:30 PM - 4:00 PM)"},{"location":"logistics/meals/#closing-dinner-600-pm-800-pm","text":"Location: Union South - Industry (3rd Floor) Global Buffet Spinach, Strawberry, Shaved Red Onion, Sesame Poppy Seed Dressing Vegetables, Dips, Spreads, Pita Chips Chicken Tikka Masala Sake Salmon Jerk Tofu Basmati Rice Naan Includes choice of coffee station or assorted cold beverages","title":"Closing Dinner (6:00 PM - 8:00 PM)"},{"location":"logistics/meals/#meal-reimbursement-tips","text":"Again, if you are not part of the UW\u2013Madison community, we can reimburse you for dinners Monday through Thursday. We have curated a page of some possible dining options to use as inspiration. Some tips for successful reimbursements: Keep receipts for your meals \u2013 if anything so that you remember how much meals cost! We can reimburse up to $35 for dinner, including tax and tip. If it is not on the receipt, be sure to write the tip amount yourself, so you do not forget. We cannot pay for any alcohol, although non-alcoholic drinks are OK \u2014 ideally, pay for alcohol separately. We will explain the reimbursement process in detail after the School, but the tips above will help.","title":"Meal Reimbursement Tips"},{"location":"logistics/travel-advice/","text":"Travel Advice \u00b6 This page offers some tips for traveling to and from the OSG School. When travelling, you may experience delays, changes, or cancellations due to weather, mechanical issues, and so on. It is good to be prepared for last-minute changes. Below are some tips and ideas for dealing with travel. For health guidelines, before or during the event, please see our health guidelines page . Checking In Early \u00b6 Airlines generally allow you to check in for your flights the day before. Doing so may save you time and hassle at the airport. Go to your airline website and look for the \u201cCheck In\u201d section, then follow the steps. Finding Flight Status \u00b6 Be sure to check your flight status often, starting the day before travel begins. While you can check the status of each flight individually on the airline website (or a third-party site), you may be able to view your entire trip at once. Go to your airline website, find their section for \u201cMy Trips\u201d or something similar, and use the six-character \u201cConfirmation Number\u201d on your itinerary plus your last name to access your full itinerary, including flight status for each segment. Definitely check your flight status before leaving for the airport! If Your Arrival in Madison Is Delayed \u00b6 If your flights change and you will arrive in Madison later than planned, think about what effect that will have: If you will arrive before Sunday, 6 p.m. (or so), you should be fine. If there is time, you can still go to the hotel first; if it is after 5:30 p.m. (or so), it may be best to go straight to the Fluno Center for the welcome dinner at 6:30 p.m. If you will arrive on Sunday but after 6 p.m. (or so), you will miss the welcome dinner. Go straight to the hotel and check in, then find dinner on your own; we can reimburse you in this case. Try to let us know about the situation, when you can. If you will arrive later than Sunday, just do your best to get here. Try to let us know about your situation as soon as you can. We can help deal things like the hotel and may be able to suggest travel options. If you need to make flight changes, see below. If Your Arrival Back Home Is Delayed \u00b6 If you flights back home are delayed, there is not as much that we can do. For example, it is not clear whether we can pay for changes on return flights. Contact your airline to find out how they will get you home. If You Must Make Flight Changes \u00b6 If one or more flights are cancelled, or if we approve flight changes and their fees in advance , you will need to make new plans with your airline. If you are at an airport, it is a good idea to get in line at your airline\u2019s service counter right away. Also, you can try calling their service number while waiting in line! For any change that requires extra payment, you must get our approval and make the change through Fox World Travel , UW\u2013Madison\u2019s only approved travel agency. If you pay for a change any other way, we cannot reimburse you. Fox World Travel phone number: +1 (844) 630-3853 Note: If you call Fox World Travel on the weekend or outside of 7am\u20137:30pm (Central), they will charge us $20 just for calling. So please use this option only when you must pay for approved flight changes. If there are significant changes to your travel plans, when you have time, please email us with your news or reach out to us on Slack.","title":"Travel advice"},{"location":"logistics/travel-advice/#travel-advice","text":"This page offers some tips for traveling to and from the OSG School. When travelling, you may experience delays, changes, or cancellations due to weather, mechanical issues, and so on. It is good to be prepared for last-minute changes. Below are some tips and ideas for dealing with travel. For health guidelines, before or during the event, please see our health guidelines page .","title":"Travel Advice"},{"location":"logistics/travel-advice/#checking-in-early","text":"Airlines generally allow you to check in for your flights the day before. Doing so may save you time and hassle at the airport. Go to your airline website and look for the \u201cCheck In\u201d section, then follow the steps.","title":"Checking In Early"},{"location":"logistics/travel-advice/#finding-flight-status","text":"Be sure to check your flight status often, starting the day before travel begins. While you can check the status of each flight individually on the airline website (or a third-party site), you may be able to view your entire trip at once. Go to your airline website, find their section for \u201cMy Trips\u201d or something similar, and use the six-character \u201cConfirmation Number\u201d on your itinerary plus your last name to access your full itinerary, including flight status for each segment. Definitely check your flight status before leaving for the airport!","title":"Finding Flight Status"},{"location":"logistics/travel-advice/#if-your-arrival-in-madison-is-delayed","text":"If your flights change and you will arrive in Madison later than planned, think about what effect that will have: If you will arrive before Sunday, 6 p.m. (or so), you should be fine. If there is time, you can still go to the hotel first; if it is after 5:30 p.m. (or so), it may be best to go straight to the Fluno Center for the welcome dinner at 6:30 p.m. If you will arrive on Sunday but after 6 p.m. (or so), you will miss the welcome dinner. Go straight to the hotel and check in, then find dinner on your own; we can reimburse you in this case. Try to let us know about the situation, when you can. If you will arrive later than Sunday, just do your best to get here. Try to let us know about your situation as soon as you can. We can help deal things like the hotel and may be able to suggest travel options. If you need to make flight changes, see below.","title":"If Your Arrival in Madison Is Delayed"},{"location":"logistics/travel-advice/#if-your-arrival-back-home-is-delayed","text":"If you flights back home are delayed, there is not as much that we can do. For example, it is not clear whether we can pay for changes on return flights. Contact your airline to find out how they will get you home.","title":"If Your Arrival Back Home Is Delayed"},{"location":"logistics/travel-advice/#if-you-must-make-flight-changes","text":"If one or more flights are cancelled, or if we approve flight changes and their fees in advance , you will need to make new plans with your airline. If you are at an airport, it is a good idea to get in line at your airline\u2019s service counter right away. Also, you can try calling their service number while waiting in line! For any change that requires extra payment, you must get our approval and make the change through Fox World Travel , UW\u2013Madison\u2019s only approved travel agency. If you pay for a change any other way, we cannot reimburse you. Fox World Travel phone number: +1 (844) 630-3853 Note: If you call Fox World Travel on the weekend or outside of 7am\u20137:30pm (Central), they will charge us $20 just for calling. So please use this option only when you must pay for approved flight changes. If there are significant changes to your travel plans, when you have time, please email us with your news or reach out to us on Slack.","title":"If You Must Make Flight Changes"},{"location":"logistics/travel-planning/","text":"Travel To and From Madison \u00b6 Please wait to begin making travel arrangements until we email you about it. We plan to email everyone about travel in early June, but are starting with a small group to find and fix issues. Whether we offered to pay your travel costs or not, please make sure that we get a copy of your travel plans so that we know when to expect you here and can plan accurately. (If we offered to pay for your hotel room, we will reserve your hotel room for you.) Find the numbered section below that applies to you: 1. We Offered to Pay for Your Travel \u00b6 We want to find reasonable and comfortable travel options for you. At the same time, we must stay within budget and follow University rules about arranging and paying for your travel costs. Let\u2019s work together to find something that makes sense for everyone. Here are ideas that have helped some School travelers in past years: If you are near Madison, consider driving; we can reimburse mileage and tolls up to a point, plus parking. Or look into bus routes, especially from larger cities like Chicago. The buses are very comfortable, have wi-fi, and run frequently. If you fly, try to get flights to and from Madison (MSN) itself. In some cases, we may ask you to consider flying to Milwaukee (1\u00bd hours away) or Chicago (2\u00bd hours away), then taking a direct bus to Madison; we do this only when the costs or itinerary options to Madison are terrible. If you fly, be flexible about departure times \u2014 early and late flights are often the least expensive. We do not like very early or very late flights any more than you do, so we will work hard to find reasonable flight times. Note: Please try to complete your travel plans before about July 4th, when rates may go up. Travel by Airplane \u00b6 Do NOT buy your own airline tickets . University rules say that our travel agency, Travel Incorporated, must buy your tickets. Note: The University is changing travel agencies on 1 July 2024. Please try to complete air travel arrangements by Thursday, 27 June 2024. Use the following information to get air travel tickets: In the travel email that we sent you, click the link to Travel Incorporated\u2019s \u201cUWS Traveler Booking Form\u201d (on smartsheet.com); on that form: Group Number: Copy and paste this: UWMSN061523 Traveler Type: Select \u201cGuest\u201d Concur Profile? Select \u201cNo\u201d Destination Type: Select \u201cDomestic\u201d Will a rental be needed? Select \u201cNo\u201d \u2014 we cannot pay for a rental car Are Hotel Accommodations needed? Select \u201cNo\u201d \u2014 we will arrange your hotel room separately Guest Information: Please contact us first to bring guests We must review and approve some itineraries. Travel Inc can purchase tickets directly in many cases. But if the Travel Inc agent says that your trip must be reviewed, do not worry! It just means that we need to check the budget, options, and UW rules. We hope to approve your first choice, or we will work with you and Travel Inc to find another reasonable one. Common reasons for a trip needing review are: total trip cost over $800, travel starting and ending at different locations, and travel on dates other than August 4 and 10. Approval takes time, so it may take 1\u20132 days to get confirmation. Airplane tickets cannot be held without purchase over a weekend, so avoid contacting Travel Inc late on Fridays. Please be considerate of the Travel Inc agent(s) you work with. They work hard to find good options for you, but they must also follow our rules. If you feel that they are not providing the options that you want, you should email us . We will help resolve any issues. Do not argue with the Travel Inc agents, especially about options you find online \u2014 there are many reasons why that option might not be available to us. Travel by Bus \u00b6 For some nearby locations, or in addition to air travel to Chicago or Milwaukee, it may be helpful to take a bus to Madison. Bus companies that School travelers have used often in the past are: Van Galder Bus , especially from Chicago Badger Bus , especially from Milwaukee To get bus tickets, pick one method: Ask us to buy bus tickets for you in advance. This is the easiest option all around. Just email us at school@osg-htc.org ; include your desired travel dates (tickets are not specific by time), and start and end bus stations or stops. Buy bus tickets for yourself. You may purchase bus tickets yourself before or on the day of travel. If you purchase your own tickets, you must get approval from the School for the estimated cost first, then request reimbursement from us after the School. If you purchase your own tickets, save the original receipt (even if by email). It is best to have a detailed receipt (including your name, itinerary, date of purchase, and total amount paid), but a regular ticket stub (e.g., without your name or date) should work fine. Just get what you can! Be sure to email us with your bus plans, including: Transportation provider(s) (e.g., Van Galder bus) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison Actual or estimated cost (indicate which) Travel by Personal Car \u00b6 If you are driving to Madison, you will be reimbursed the mileage rate of $0.670 per mile for the shortest round-trip distance (as calculated by Google Maps), plus tolls. Also, we will pay for parking costs for the week at the hotel in Madison (but not elsewhere). We recommend keeping your receipts for tolls. Note: Due to the high mileage reimbursement rate, driving can be an expensive option! We reserve the right to limit your total driving reimbursement, so work with us on the details. To travel by personal car, please check with us first. We may search for comparable flight options, to make sure that driving is the least expensive method. Be sure to email us with your travel plans as soon as possible. Try to include: Departure date from home, location (for mileage calculation), and approximate time of arrival in Madison Departure date and approximate time from Madison, and return location (for mileage calculation) if different than above 2. We Are Not Paying for Your Travel \u00b6 If you are paying for your own travel or if someone else is paying for it, go ahead and make your travel arrangements now! Just remember to arrive on Sunday, August 4, before about 5:00 pm and depart on Saturday, August 10, or whatever dates we suggested directly to you. For other travel dates, check with us first, please! Be sure to email us with your travel plans as soon as possible. Try to include: Transportation provider(s) (e.g., airline) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison (e.g., airport, bus station, etc.)","title":"Travel planning"},{"location":"logistics/travel-planning/#travel-to-and-from-madison","text":"Please wait to begin making travel arrangements until we email you about it. We plan to email everyone about travel in early June, but are starting with a small group to find and fix issues. Whether we offered to pay your travel costs or not, please make sure that we get a copy of your travel plans so that we know when to expect you here and can plan accurately. (If we offered to pay for your hotel room, we will reserve your hotel room for you.) Find the numbered section below that applies to you:","title":"Travel To and From Madison"},{"location":"logistics/travel-planning/#1-we-offered-to-pay-for-your-travel","text":"We want to find reasonable and comfortable travel options for you. At the same time, we must stay within budget and follow University rules about arranging and paying for your travel costs. Let\u2019s work together to find something that makes sense for everyone. Here are ideas that have helped some School travelers in past years: If you are near Madison, consider driving; we can reimburse mileage and tolls up to a point, plus parking. Or look into bus routes, especially from larger cities like Chicago. The buses are very comfortable, have wi-fi, and run frequently. If you fly, try to get flights to and from Madison (MSN) itself. In some cases, we may ask you to consider flying to Milwaukee (1\u00bd hours away) or Chicago (2\u00bd hours away), then taking a direct bus to Madison; we do this only when the costs or itinerary options to Madison are terrible. If you fly, be flexible about departure times \u2014 early and late flights are often the least expensive. We do not like very early or very late flights any more than you do, so we will work hard to find reasonable flight times. Note: Please try to complete your travel plans before about July 4th, when rates may go up.","title":"1. We Offered to Pay for Your Travel"},{"location":"logistics/travel-planning/#travel-by-airplane","text":"Do NOT buy your own airline tickets . University rules say that our travel agency, Travel Incorporated, must buy your tickets. Note: The University is changing travel agencies on 1 July 2024. Please try to complete air travel arrangements by Thursday, 27 June 2024. Use the following information to get air travel tickets: In the travel email that we sent you, click the link to Travel Incorporated\u2019s \u201cUWS Traveler Booking Form\u201d (on smartsheet.com); on that form: Group Number: Copy and paste this: UWMSN061523 Traveler Type: Select \u201cGuest\u201d Concur Profile? Select \u201cNo\u201d Destination Type: Select \u201cDomestic\u201d Will a rental be needed? Select \u201cNo\u201d \u2014 we cannot pay for a rental car Are Hotel Accommodations needed? Select \u201cNo\u201d \u2014 we will arrange your hotel room separately Guest Information: Please contact us first to bring guests We must review and approve some itineraries. Travel Inc can purchase tickets directly in many cases. But if the Travel Inc agent says that your trip must be reviewed, do not worry! It just means that we need to check the budget, options, and UW rules. We hope to approve your first choice, or we will work with you and Travel Inc to find another reasonable one. Common reasons for a trip needing review are: total trip cost over $800, travel starting and ending at different locations, and travel on dates other than August 4 and 10. Approval takes time, so it may take 1\u20132 days to get confirmation. Airplane tickets cannot be held without purchase over a weekend, so avoid contacting Travel Inc late on Fridays. Please be considerate of the Travel Inc agent(s) you work with. They work hard to find good options for you, but they must also follow our rules. If you feel that they are not providing the options that you want, you should email us . We will help resolve any issues. Do not argue with the Travel Inc agents, especially about options you find online \u2014 there are many reasons why that option might not be available to us.","title":"Travel by Airplane"},{"location":"logistics/travel-planning/#travel-by-bus","text":"For some nearby locations, or in addition to air travel to Chicago or Milwaukee, it may be helpful to take a bus to Madison. Bus companies that School travelers have used often in the past are: Van Galder Bus , especially from Chicago Badger Bus , especially from Milwaukee To get bus tickets, pick one method: Ask us to buy bus tickets for you in advance. This is the easiest option all around. Just email us at school@osg-htc.org ; include your desired travel dates (tickets are not specific by time), and start and end bus stations or stops. Buy bus tickets for yourself. You may purchase bus tickets yourself before or on the day of travel. If you purchase your own tickets, you must get approval from the School for the estimated cost first, then request reimbursement from us after the School. If you purchase your own tickets, save the original receipt (even if by email). It is best to have a detailed receipt (including your name, itinerary, date of purchase, and total amount paid), but a regular ticket stub (e.g., without your name or date) should work fine. Just get what you can! Be sure to email us with your bus plans, including: Transportation provider(s) (e.g., Van Galder bus) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison Actual or estimated cost (indicate which)","title":"Travel by Bus"},{"location":"logistics/travel-planning/#travel-by-personal-car","text":"If you are driving to Madison, you will be reimbursed the mileage rate of $0.670 per mile for the shortest round-trip distance (as calculated by Google Maps), plus tolls. Also, we will pay for parking costs for the week at the hotel in Madison (but not elsewhere). We recommend keeping your receipts for tolls. Note: Due to the high mileage reimbursement rate, driving can be an expensive option! We reserve the right to limit your total driving reimbursement, so work with us on the details. To travel by personal car, please check with us first. We may search for comparable flight options, to make sure that driving is the least expensive method. Be sure to email us with your travel plans as soon as possible. Try to include: Departure date from home, location (for mileage calculation), and approximate time of arrival in Madison Departure date and approximate time from Madison, and return location (for mileage calculation) if different than above","title":"Travel by Personal Car"},{"location":"logistics/travel-planning/#2-we-are-not-paying-for-your-travel","text":"If you are paying for your own travel or if someone else is paying for it, go ahead and make your travel arrangements now! Just remember to arrive on Sunday, August 4, before about 5:00 pm and depart on Saturday, August 10, or whatever dates we suggested directly to you. For other travel dates, check with us first, please! Be sure to email us with your travel plans as soon as possible. Try to include: Transportation provider(s) (e.g., airline) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison (e.g., airport, bus station, etc.)","title":"2. We Are Not Paying for Your Travel"},{"location":"logistics/visas/","text":"Documentation Requirements for Non-Resident Aliens \u00b6 This page is for Non-Resident Aliens only. If you are a United States citizen or permanent resident or member of the UW\u2013Madison community, this page does not apply to you. For the University of Wisconsin to pay for your travel, hotel, or meal expenses, we must have certain personal information from you. We collect as little information as possible and do not share it except with University staff who need it. Most of what we need comes from the online form you completed after accepting our invitation to attend. When you come to the School in Madison, we will need to look at and verify your travel documents. Please bring all travel documents to the School! See below for details. Tasks To Do Now \u00b6 Please check your passport and visa for travel in the United States now. Make sure that all documents are valid from now and until after the School ends. If any documents are expired or will expire before the end of the School: Tell us immediately, so that we can help you Begin the process for updating your documents immediately Do whatever you can to expedite the update process The University of Wisconsin cannot pay for or reimburse you for costs without valid travel documents. We have no control over this policy and there are no exceptions. If you are in the United States on a J-1 Scholar visa, there are extra steps needed to make the University and Federal government happy. If you have a J-1 visa and have not heard from us about it already, please email us immediately so that we can help. Documents to Bring to the School \u00b6 When you come to Madison, you must bring: Passport U.S. visa U.S. Customs and Border Protection form I-94 If you entered the U.S. before 30 April 2013, the I-94 should be stapled into your passport \u2014 do not remove it! If you entered the U.S. after 30 April 2013, the I-94 is stored electronically; you can request a copy to print from CBP If you are Canadian, you may use a second form of picture ID instead of the I-94 if you did not obtain an I-94. Additional forms specified in the table below: If you have this visa We will also need F-1 (Student) Form I-20 (original document, not a copy) J-1 (Visitor) Form DS-2019 (original document, not a copy) Visa Waiver Program Paper copy of ESTA Authorization Please bring all required information and documents to the School, especially on Tuesday, August 6. School staff will make copies of the documents and return them to you as quickly as possible. We will announce further details in class.","title":"Visa requirements"},{"location":"logistics/visas/#documentation-requirements-for-non-resident-aliens","text":"This page is for Non-Resident Aliens only. If you are a United States citizen or permanent resident or member of the UW\u2013Madison community, this page does not apply to you. For the University of Wisconsin to pay for your travel, hotel, or meal expenses, we must have certain personal information from you. We collect as little information as possible and do not share it except with University staff who need it. Most of what we need comes from the online form you completed after accepting our invitation to attend. When you come to the School in Madison, we will need to look at and verify your travel documents. Please bring all travel documents to the School! See below for details.","title":"Documentation Requirements for Non-Resident Aliens"},{"location":"logistics/visas/#tasks-to-do-now","text":"Please check your passport and visa for travel in the United States now. Make sure that all documents are valid from now and until after the School ends. If any documents are expired or will expire before the end of the School: Tell us immediately, so that we can help you Begin the process for updating your documents immediately Do whatever you can to expedite the update process The University of Wisconsin cannot pay for or reimburse you for costs without valid travel documents. We have no control over this policy and there are no exceptions. If you are in the United States on a J-1 Scholar visa, there are extra steps needed to make the University and Federal government happy. If you have a J-1 visa and have not heard from us about it already, please email us immediately so that we can help.","title":"Tasks To Do Now"},{"location":"logistics/visas/#documents-to-bring-to-the-school","text":"When you come to Madison, you must bring: Passport U.S. visa U.S. Customs and Border Protection form I-94 If you entered the U.S. before 30 April 2013, the I-94 should be stapled into your passport \u2014 do not remove it! If you entered the U.S. after 30 April 2013, the I-94 is stored electronically; you can request a copy to print from CBP If you are Canadian, you may use a second form of picture ID instead of the I-94 if you did not obtain an I-94. Additional forms specified in the table below: If you have this visa We will also need F-1 (Student) Form I-20 (original document, not a copy) J-1 (Visitor) Form DS-2019 (original document, not a copy) Visa Waiver Program Paper copy of ESTA Authorization Please bring all required information and documents to the School, especially on Tuesday, August 6. School staff will make copies of the documents and return them to you as quickly as possible. We will announce further details in class.","title":"Documents to Bring to the School"},{"location":"materials/","text":"OSG School Materials \u00b6 School Overview and Intro \u00b6 View the slides: [Slides coming soon] Intro to HTC and HTCondor Job Execution \u00b6 Intro to HTC Slides \u00b6 Intro to HTC: [Slides coming soon] Worksheet: [Slides coming soon] Intro to HTCondor Slides \u00b6 View the slides: pdf Intro Exercises 1: Running and Viewing Simple Jobs (Strongly Recommended) \u00b6 Exercise 1.1: Log in to the local submit machine and look around Exercise 1.2: Experiment with HTCondor commands Exercise 1.3: Run jobs! Exercise 1.4: Read and interpret log files Exercise 1.5: Determining Resource Needs Exercise 1.6: Remove jobs from the queue Bonus Exercises: Job Attributes and Handling \u00b6 Bonus Exercise 1.7: Compile and run some C code Bonus Exercise 1.8: Explore condor_q Bonus Exercise 1.9: Explore condor_status Intro to HTCondor Multiple Job Execution \u00b6 View the Slides: [Slides coming soon] Intro Exercises 2: Running Many HTC Jobs (Strongly Recommended) \u00b6 Exercise 2.1: Work with input and output files Exercise 2.2: Use queue N , $(Cluster) , and $(Process) Exercise 2.3: Use queue from with custom variables Bonus Exercise 2.4: Use queue matching with a custom variable OSG \u00b6 View the slides: [Slides coming soon] OSG Exercises: Comparing PATh and OSG (Strongly Recommended) \u00b6 Exercise 1.1: Log in to the OSPool Access Point Exercise 1.2: Running jobs in the OSPool Exercise 1.3: Hardware differences between PATh and OSG Exercise 1.4: Software differences in OSPool Troubleshooting \u00b6 Slides: [Slides coming soon] Troubleshooting Exercises: \u00b6 Exercise 1.1: Troubleshooting Jobs Exercise 1.2: Job Retry Software \u00b6 Slides: [Slides coming soon] Software Exercises 1: Exploring Containers \u00b6 Exercise 1.1: Run and Explore Apptainer Containers Exercise 1.2: Use Apptainer Containers in OSPool Jobs Exercise 1.3: Use Docker Containers in OSPool Jobs Exercise 1.4: Build, Test, and Deploy an Apptainer Container Exercise 1.5: Choose Software Options Software Exercises 2: Preparing Scripts \u00b6 Exercise 2.1: Build an HTC-Friendly Executable Software Exercises 3: Container Examples (Optional) \u00b6 Exercise 3.1: Create an Apptainer Definition Files Exercise 3.2: Build Your Own Docker Container Software Exercises 4: Exploring Compiled Software (Optional) \u00b6 Exercise 4.1: Download and Use Compiled Software Exercise 4.2: Use a Wrapper Script To Run Software Exercise 4.3: Using Arguments With Wrapper Scripts Software Exercises 5: Compiled Software Examples (Optional) \u00b6 Exercise 5.1: Compiling a Research Software Exercise 5.2: Compiling Python and Running Jobs Exercise 5.3: Using Conda Environments Exercise 5.4: Compiling and Running a Simple Code Data \u00b6 View the slides: [Slides coming soon] Data Exercises 1: HTCondor File Transfer (Strongly Recommended) \u00b6 Exercise 1.1: Understanding a job's data needs Exercise 1.2: transfer_input_files, transfer_output_files, and remaps Exercise 1.3: Splitting input Data Exercises 2: Using OSDF (Strongly Recommended) \u00b6 Exercise 2.1: OSDF for inputs Exercise 2.2: OSDF for outputs Scaling Up \u00b6 View the slides: [Slides coming soon] Scaling Up Exercises \u00b6 Exercise 1.1: Organizing HTC workloads Exercise 1.2: Investigating Job Attributes Exercise 1.3: Getting Job Information from Log Files Workflows with DAGMan \u00b6 View the slides: [Slides coming soon] DAGMan Exercises 1 \u00b6 Exercise 1.1: Coordinating set of jobs: A simple DAG Exercise 1.2: A brief detour through the Mandelbrot set Exercise 1.3: A more complex DAG Exercise 1.4: Handling jobs that fail with DAGMan Exercise 1.5: Workflow Challenges Extra Topics \u00b6 Self-checkpointing for long-running jobs \u00b6 View the slides: [Slides coming soon] Exercise 1.1: Trying out self-checkpointing Special Environments \u00b6 View the slides: [Slides coming soon] Special Environments Exercises 1 \u00b6 Exercise 1.1: GPUs Introduction to Research Computing Facilitation \u00b6 View the slides: [Slides coming soon] Final Talks \u00b6 Philosophy: [Slides coming soon] Final thoughts: [Slides coming soon]","title":"Overview"},{"location":"materials/#osg-school-materials","text":"","title":"OSG School Materials"},{"location":"materials/#school-overview-and-intro","text":"View the slides: [Slides coming soon]","title":"School Overview and Intro"},{"location":"materials/#intro-to-htc-and-htcondor-job-execution","text":"","title":"Intro to HTC and HTCondor Job Execution"},{"location":"materials/#intro-to-htc-slides","text":"Intro to HTC: [Slides coming soon] Worksheet: [Slides coming soon]","title":"Intro to HTC Slides"},{"location":"materials/#intro-to-htcondor-slides","text":"View the slides: pdf","title":"Intro to HTCondor Slides"},{"location":"materials/#intro-exercises-1-running-and-viewing-simple-jobs-strongly-recommended","text":"Exercise 1.1: Log in to the local submit machine and look around Exercise 1.2: Experiment with HTCondor commands Exercise 1.3: Run jobs! Exercise 1.4: Read and interpret log files Exercise 1.5: Determining Resource Needs Exercise 1.6: Remove jobs from the queue","title":"Intro Exercises 1: Running and Viewing Simple Jobs (Strongly Recommended)"},{"location":"materials/#bonus-exercises-job-attributes-and-handling","text":"Bonus Exercise 1.7: Compile and run some C code Bonus Exercise 1.8: Explore condor_q Bonus Exercise 1.9: Explore condor_status","title":"Bonus Exercises: Job Attributes and Handling"},{"location":"materials/#intro-to-htcondor-multiple-job-execution","text":"View the Slides: [Slides coming soon]","title":"Intro to HTCondor Multiple Job Execution"},{"location":"materials/#intro-exercises-2-running-many-htc-jobs-strongly-recommended","text":"Exercise 2.1: Work with input and output files Exercise 2.2: Use queue N , $(Cluster) , and $(Process) Exercise 2.3: Use queue from with custom variables Bonus Exercise 2.4: Use queue matching with a custom variable","title":"Intro Exercises 2: Running Many HTC Jobs (Strongly Recommended)"},{"location":"materials/#osg","text":"View the slides: [Slides coming soon]","title":"OSG"},{"location":"materials/#osg-exercises-comparing-path-and-osg-strongly-recommended","text":"Exercise 1.1: Log in to the OSPool Access Point Exercise 1.2: Running jobs in the OSPool Exercise 1.3: Hardware differences between PATh and OSG Exercise 1.4: Software differences in OSPool","title":"OSG Exercises: Comparing PATh and OSG (Strongly Recommended)"},{"location":"materials/#troubleshooting","text":"Slides: [Slides coming soon]","title":"Troubleshooting"},{"location":"materials/#troubleshooting-exercises","text":"Exercise 1.1: Troubleshooting Jobs Exercise 1.2: Job Retry","title":"Troubleshooting Exercises:"},{"location":"materials/#software","text":"Slides: [Slides coming soon]","title":"Software"},{"location":"materials/#software-exercises-1-exploring-containers","text":"Exercise 1.1: Run and Explore Apptainer Containers Exercise 1.2: Use Apptainer Containers in OSPool Jobs Exercise 1.3: Use Docker Containers in OSPool Jobs Exercise 1.4: Build, Test, and Deploy an Apptainer Container Exercise 1.5: Choose Software Options","title":"Software Exercises 1: Exploring Containers"},{"location":"materials/#software-exercises-2-preparing-scripts","text":"Exercise 2.1: Build an HTC-Friendly Executable","title":"Software Exercises 2: Preparing Scripts"},{"location":"materials/#software-exercises-3-container-examples-optional","text":"Exercise 3.1: Create an Apptainer Definition Files Exercise 3.2: Build Your Own Docker Container","title":"Software Exercises 3: Container Examples (Optional)"},{"location":"materials/#software-exercises-4-exploring-compiled-software-optional","text":"Exercise 4.1: Download and Use Compiled Software Exercise 4.2: Use a Wrapper Script To Run Software Exercise 4.3: Using Arguments With Wrapper Scripts","title":"Software Exercises 4: Exploring Compiled Software (Optional)"},{"location":"materials/#software-exercises-5-compiled-software-examples-optional","text":"Exercise 5.1: Compiling a Research Software Exercise 5.2: Compiling Python and Running Jobs Exercise 5.3: Using Conda Environments Exercise 5.4: Compiling and Running a Simple Code","title":"Software Exercises 5: Compiled Software Examples (Optional)"},{"location":"materials/#data","text":"View the slides: [Slides coming soon]","title":"Data"},{"location":"materials/#data-exercises-1-htcondor-file-transfer-strongly-recommended","text":"Exercise 1.1: Understanding a job's data needs Exercise 1.2: transfer_input_files, transfer_output_files, and remaps Exercise 1.3: Splitting input","title":"Data Exercises 1: HTCondor File Transfer (Strongly Recommended)"},{"location":"materials/#data-exercises-2-using-osdf-strongly-recommended","text":"Exercise 2.1: OSDF for inputs Exercise 2.2: OSDF for outputs","title":"Data Exercises 2: Using OSDF (Strongly Recommended)"},{"location":"materials/#scaling-up","text":"View the slides: [Slides coming soon]","title":"Scaling Up"},{"location":"materials/#scaling-up-exercises","text":"Exercise 1.1: Organizing HTC workloads Exercise 1.2: Investigating Job Attributes Exercise 1.3: Getting Job Information from Log Files","title":"Scaling Up Exercises"},{"location":"materials/#workflows-with-dagman","text":"View the slides: [Slides coming soon]","title":"Workflows with DAGMan"},{"location":"materials/#dagman-exercises-1","text":"Exercise 1.1: Coordinating set of jobs: A simple DAG Exercise 1.2: A brief detour through the Mandelbrot set Exercise 1.3: A more complex DAG Exercise 1.4: Handling jobs that fail with DAGMan Exercise 1.5: Workflow Challenges","title":"DAGMan Exercises 1"},{"location":"materials/#extra-topics","text":"","title":"Extra Topics"},{"location":"materials/#self-checkpointing-for-long-running-jobs","text":"View the slides: [Slides coming soon] Exercise 1.1: Trying out self-checkpointing","title":"Self-checkpointing for long-running jobs"},{"location":"materials/#special-environments","text":"View the slides: [Slides coming soon]","title":"Special Environments"},{"location":"materials/#special-environments-exercises-1","text":"Exercise 1.1: GPUs","title":"Special Environments Exercises 1"},{"location":"materials/#introduction-to-research-computing-facilitation","text":"View the slides: [Slides coming soon]","title":"Introduction to Research Computing Facilitation"},{"location":"materials/#final-talks","text":"Philosophy: [Slides coming soon] Final thoughts: [Slides coming soon]","title":"Final Talks"},{"location":"materials/checkpoint/part1-ex1-checkpointing/","text":"Self-Checkpointing Exercise 1.1: Trying It Out \u00b6 The goal of this exercise is to practice writing a submit file for self-checkpointing, and to see the process in action. Calculating Fibonacci numbers \u2026 slowly \u00b6 The sample code for this exercise calculates the Fibonacci number resulting from a given set of iterations. Because this is a trival computation, the code includes a delay in each iteration through the main loop; this simulates a more intensive computation. To get set up: Log in to ap40.uw.osg-htc.org ( ap1 is fine, too) Create and change into a new directory for this exercise Download the Python script that is the main executable for this exercise: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/fibonacci.py If you want to run the script directly, make it executable first: user@server $ chmod 0755 fibonacci.py Take a look at the code, if you like. It is not very elegant, but it gets the job done. A few notes: The script takes a single argument, the number of iterations to run. To minimize computing time while leaving time to explore, 10 is a good number of iterations. The script checkpoints every other iteration through the main loop. The exit status code for a checkpoint is 85. It prints some output to standard out along the way, to let you know what is going on. The final result is written to a separate file named fibonacci.result . This file does not exist until the very end of the complete run. It is safe to run from the command line on an access point: user@server $ ./fibonacci.py 10 If you run it, what happens? (Due to the 30-second delay, be patient.) Can you explain its behavior? What happens if you run it again, without changing any files in between? Why? Preparing to run \u00b6 Now you have an executable and you know how to run it. It is time to prepare it for submission to HTCondor! Using what you know about the script (above), and using information in the slides from today, try writing a submit file that runs this software and implements exit-driven self-checkpointing. The Python code itself is ready and should not need any changes. Just use a plain queue statement, one job is enough to experiment on. Before you submit, read the next section first! Running and monitoring \u00b6 With the 30-second delay per iteration in the code and the suggested 10 iterations, once the script starts running you have about 5 minutes of runtime in which to see what is going on. So it may help to read through this section and then return here and submit your job. If your job has problems or finishes before you have the chance to do all the steps below, just remove the extra files (besides the Python script and your submit file) and try again! Submission and first checkpoint \u00b6 Submit the job Look at the contents of the submit directory \u2014 what changed? Start watching the log file: tail -n 100 -f YOUR-LOG-FILENAME.log Be patient! As HTCondor adds more lines to the end of your log file, they will appear automatically. Thus, nothing much will happen until HTCondor starts running your job. When it does, you will see three sets of messages in the log file quickly: Started transferring input files Finished transferring input files Job executing on host: (Of course, each message will contain a lot of other characters!) Now wait about 1 minute, and you should see two more messages appear: Started transferring output files Finished transferring output files That is the first checkpoint happening! Forcing your job to stop running \u00b6 Now, assuming that your job is still running (check condor_q again), you can force HTCondor to remove ( evict ) your job before it finishes: Run condor_q to get the job ID of the running job Run condor_vacate_job JOB_ID , where you replace JOB_ID with your job ID from above Monitor the action again by running tail -n 100 -f YOUR-LOG-FILENAME.log Finishing the job and wrap-up \u00b6 Be patient again! You removed your running job, and so HTCondor put it back in the queue as idle. If you wait a minute or two, you should see that HTCondor starts running the job again. In the log file, look carefully for the two Job executing on host: messages. Does it seem like you ran on the same computer again or on a different one? Both are possible! Let your job finish running this time. There should be a Job terminated of its own accord message near the end. Did you get results? Go through all the files and see what they contain. The log and output files are probably the most interesting. But did you get a result file, too? Did the output file \u2014 that is, whatever file you named in the output line of your submit file \u2014 contain everything that you expected it to? Conclusion \u00b6 This has been a brief and simple tour of self-checkpointing. If you would like to learn more, please read the Self-Checkpointing Applications section of the HTCondor Manual. Or talk to School staff about it. Or contact support@osg-htc.org for further help at any time.","title":"1.1 - Trying out self-checkpointing"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#self-checkpointing-exercise-11-trying-it-out","text":"The goal of this exercise is to practice writing a submit file for self-checkpointing, and to see the process in action.","title":"Self-Checkpointing Exercise 1.1: Trying It Out"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#calculating-fibonacci-numbers-slowly","text":"The sample code for this exercise calculates the Fibonacci number resulting from a given set of iterations. Because this is a trival computation, the code includes a delay in each iteration through the main loop; this simulates a more intensive computation. To get set up: Log in to ap40.uw.osg-htc.org ( ap1 is fine, too) Create and change into a new directory for this exercise Download the Python script that is the main executable for this exercise: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/fibonacci.py If you want to run the script directly, make it executable first: user@server $ chmod 0755 fibonacci.py Take a look at the code, if you like. It is not very elegant, but it gets the job done. A few notes: The script takes a single argument, the number of iterations to run. To minimize computing time while leaving time to explore, 10 is a good number of iterations. The script checkpoints every other iteration through the main loop. The exit status code for a checkpoint is 85. It prints some output to standard out along the way, to let you know what is going on. The final result is written to a separate file named fibonacci.result . This file does not exist until the very end of the complete run. It is safe to run from the command line on an access point: user@server $ ./fibonacci.py 10 If you run it, what happens? (Due to the 30-second delay, be patient.) Can you explain its behavior? What happens if you run it again, without changing any files in between? Why?","title":"Calculating Fibonacci numbers … slowly"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#preparing-to-run","text":"Now you have an executable and you know how to run it. It is time to prepare it for submission to HTCondor! Using what you know about the script (above), and using information in the slides from today, try writing a submit file that runs this software and implements exit-driven self-checkpointing. The Python code itself is ready and should not need any changes. Just use a plain queue statement, one job is enough to experiment on. Before you submit, read the next section first!","title":"Preparing to run"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#running-and-monitoring","text":"With the 30-second delay per iteration in the code and the suggested 10 iterations, once the script starts running you have about 5 minutes of runtime in which to see what is going on. So it may help to read through this section and then return here and submit your job. If your job has problems or finishes before you have the chance to do all the steps below, just remove the extra files (besides the Python script and your submit file) and try again!","title":"Running and monitoring"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#submission-and-first-checkpoint","text":"Submit the job Look at the contents of the submit directory \u2014 what changed? Start watching the log file: tail -n 100 -f YOUR-LOG-FILENAME.log Be patient! As HTCondor adds more lines to the end of your log file, they will appear automatically. Thus, nothing much will happen until HTCondor starts running your job. When it does, you will see three sets of messages in the log file quickly: Started transferring input files Finished transferring input files Job executing on host: (Of course, each message will contain a lot of other characters!) Now wait about 1 minute, and you should see two more messages appear: Started transferring output files Finished transferring output files That is the first checkpoint happening!","title":"Submission and first checkpoint"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#forcing-your-job-to-stop-running","text":"Now, assuming that your job is still running (check condor_q again), you can force HTCondor to remove ( evict ) your job before it finishes: Run condor_q to get the job ID of the running job Run condor_vacate_job JOB_ID , where you replace JOB_ID with your job ID from above Monitor the action again by running tail -n 100 -f YOUR-LOG-FILENAME.log","title":"Forcing your job to stop running"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#finishing-the-job-and-wrap-up","text":"Be patient again! You removed your running job, and so HTCondor put it back in the queue as idle. If you wait a minute or two, you should see that HTCondor starts running the job again. In the log file, look carefully for the two Job executing on host: messages. Does it seem like you ran on the same computer again or on a different one? Both are possible! Let your job finish running this time. There should be a Job terminated of its own accord message near the end. Did you get results? Go through all the files and see what they contain. The log and output files are probably the most interesting. But did you get a result file, too? Did the output file \u2014 that is, whatever file you named in the output line of your submit file \u2014 contain everything that you expected it to?","title":"Finishing the job and wrap-up"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#conclusion","text":"This has been a brief and simple tour of self-checkpointing. If you would like to learn more, please read the Self-Checkpointing Applications section of the HTCondor Manual. Or talk to School staff about it. Or contact support@osg-htc.org for further help at any time.","title":"Conclusion"},{"location":"materials/data/part1-ex1-data-needs/","text":"Data Exercise 1.1: Understanding Data Requirements \u00b6 Exercise Goal \u00b6 This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a large batch of jobs or using tools for delivering large data to jobs. In this exercise we will attempt to understand the input and output of the bioinformatics application BLAST . Setup \u00b6 For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Create a directory for this exercise named blast-data and change into it Copy the Input Files \u00b6 To run BLAST, we need the executable, input file, and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Copy the BLAST executables: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ncbi-blast-2.12.0+-x64-linux.tar.gz user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz Download these files to your current directory: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: user@ap40 $ tar -xzvf pdbaa.tar.gz Understanding BLAST \u00b6 Remember that blastx is executed in a command like the following: user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db -query -out In the above, the is the name of a file containing a number of genetic sequences (e.g. mouse.fa ), and the database that these are compared against is made up of several files that begin with the same , (e.g. pdbaa/pdbaa ). The output from this analysis will be printed to that is also indicated in the command. Calculating Data Needs \u00b6 Using the files that you prepared in blast-data , we will calculate how much disk space is needed if we were to run a hypothetical BLAST job with a wrapper script, where the job: Transfers all of its input files (including the executable) as tarballs Untars the input files tarballs on the execute host Runs blastx using the untarred input files Here are some commands that will be useful for calculating your job's storage needs: List the size of a specific file: user@ap40 $ ls -lh List the sizes of all files in the current directory: user@ap40 $ ls -lh Sum the size of all files in a specific directory: user@ap40 $ du -sh Input requirements \u00b6 Total up the amount of data in all of the files necessary to run the blastx wrapper job, including the executable itself. Write down this number. Also take note of how much total data is in the pdbaa directory. Compressed Files Remember, blastx reads the un-compressed pdbaa files. Output requirements \u00b6 The output that we care about from blastx is saved in the file whose name is indicated after the -out argument to blastx . Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too. Are there any other files? Total all of these together, as well. Up next! \u00b6 Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. Next Exercise","title":"1.1 - Understanding a job's data needs"},{"location":"materials/data/part1-ex1-data-needs/#data-exercise-11-understanding-data-requirements","text":"","title":"Data Exercise 1.1: Understanding Data Requirements"},{"location":"materials/data/part1-ex1-data-needs/#exercise-goal","text":"This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a large batch of jobs or using tools for delivering large data to jobs. In this exercise we will attempt to understand the input and output of the bioinformatics application BLAST .","title":"Exercise Goal"},{"location":"materials/data/part1-ex1-data-needs/#setup","text":"For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Create a directory for this exercise named blast-data and change into it","title":"Setup"},{"location":"materials/data/part1-ex1-data-needs/#copy-the-input-files","text":"To run BLAST, we need the executable, input file, and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Copy the BLAST executables: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ncbi-blast-2.12.0+-x64-linux.tar.gz user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz Download these files to your current directory: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: user@ap40 $ tar -xzvf pdbaa.tar.gz","title":"Copy the Input Files"},{"location":"materials/data/part1-ex1-data-needs/#understanding-blast","text":"Remember that blastx is executed in a command like the following: user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db -query -out In the above, the is the name of a file containing a number of genetic sequences (e.g. mouse.fa ), and the database that these are compared against is made up of several files that begin with the same , (e.g. pdbaa/pdbaa ). The output from this analysis will be printed to that is also indicated in the command.","title":"Understanding BLAST"},{"location":"materials/data/part1-ex1-data-needs/#calculating-data-needs","text":"Using the files that you prepared in blast-data , we will calculate how much disk space is needed if we were to run a hypothetical BLAST job with a wrapper script, where the job: Transfers all of its input files (including the executable) as tarballs Untars the input files tarballs on the execute host Runs blastx using the untarred input files Here are some commands that will be useful for calculating your job's storage needs: List the size of a specific file: user@ap40 $ ls -lh List the sizes of all files in the current directory: user@ap40 $ ls -lh Sum the size of all files in a specific directory: user@ap40 $ du -sh ","title":"Calculating Data Needs"},{"location":"materials/data/part1-ex1-data-needs/#input-requirements","text":"Total up the amount of data in all of the files necessary to run the blastx wrapper job, including the executable itself. Write down this number. Also take note of how much total data is in the pdbaa directory. Compressed Files Remember, blastx reads the un-compressed pdbaa files.","title":"Input requirements"},{"location":"materials/data/part1-ex1-data-needs/#output-requirements","text":"The output that we care about from blastx is saved in the file whose name is indicated after the -out argument to blastx . Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too. Are there any other files? Total all of these together, as well.","title":"Output requirements"},{"location":"materials/data/part1-ex1-data-needs/#up-next","text":"Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. Next Exercise","title":"Up next!"},{"location":"materials/data/part1-ex2-file-transfer/","text":"Data Exercise 1.2: transfer_input_files, transfer_output_files, and remaps \u00b6 Exercise Goal \u00b6 The objective of this exercise is to refresh yourself on HTCondor file transfer, to implement file compression, and to begin examining the memory and disk space used by your jobs in order to plan larger batches. We will also explore ways to deal with output data. Setup \u00b6 The executable we'll use in this exercise and later today is the same blastx executable from previous exercises. Log in to ap40: $ ssh @ap40.uw.osg-htc.org Then change into the blast-data folder that you created in the previous exercise. Review: HTCondor File Transfer \u00b6 Recall that OSG does NOT have a shared filesystem! Instead, HTCondor transfers your executable and input files (specified with the executable and transfer_input_files submit file directives, respectively) to a working directory on the execute node, regardless of how these files were arranged on the submit node. In this exercise we'll use the same blastx example job that we used previously, but modify the submit file and test how much memory and disk space it uses on the execute node. Start with a test submit file \u00b6 We've started a submit file for you, below, which you'll add to in the remaining steps. executable = transfer_input_files = output = test.out error = test.err log = test.log request_memory = request_disk = request_cpus = 1 requirements = (OSGVO_OS_STRING == \"RHEL 9\") queue Implement file compression \u00b6 In our first blast job from the Software exercises ( 1.1 ), the database files in the pdbaa directory were all transferred, as is, but we could instead transfer them as a single, compressed file using tar . For this version of the job, let's compress our blast database files to send them to the submit node as a single tar.gz file (otherwise known as a tarball), by following the below steps: Change into the pdbaa directory and compress the database files into a single file called pdbaa_files.tar.gz using the tar command. Note that this file will be different from the pdbaa.tar.gz file that you used earlier, because it will only contain the pdbaa files, and not the pdbaa directory, itself.) Remember, a typical command for creating a tar file is: user@ap40 $ tar -cvzf Replacing with the name of the tarball that you would like to create and with a space-separated list of files and/or directories that you want inside pdbaa_files.tar.gz. Move the resulting tarball to the blast-data directory. Create a wrapper script that will first decompress the pdbaa_files.tar.gz file, and then run blast. Because this file will now be our executable in the submit file, we'll also end up transferring the blastx executable with transfer_input_files . In the blast-data directory, create a new file, called blast_wrapper.sh , with the following contents: #!/bin/bash tar -xzvf pdbaa_files.tar.gz ./blastx -db pdbaa -query mouse.fa -out mouse.fa.result rm pdbaa.* Also remember to make the script executable: chmod +x blast_wrapper.sh Extra Files! The last line removes the resulting database files that came from pdbaa_files.tar.gz , as these files would otherwise be copied back to the submit server as perceived output since they're \"new\" files that HTCondor didn't transfer over as input. List the executable and input files \u00b6 Make sure to update the submit file with the following: Add the new executable (the wrapper script you created above) In transfer_input_files , list the blastx binary, the pdbaa_files.tar.gz file, and the input query file. Commas, commas everywhere! Remember that transfer_input_files accepts a comma separated list of files, and that you need to list the full location of the blastx executable ( blastx ). There will be no arguments, since the arguments to the blastx command are now captured in the wrapper script. Predict memory and disk requests from your data \u00b6 Also, think about how much memory and disk to request for this job. It's good to start with values that are a little higher than you think a test job will need, but think about: How much memory blastx would use if it loaded all of the database files and the query input file into memory. How much disk space will be necessary on the execute server for the executable, all input files, and all output files (hint: the log file only exists on the submit node). Whether you'd like to request some extra memory or disk space, just in case Look at the log file for your blastx job from Software exercise ( 1.1 ), and compare the memory and disk \"Usage\" to what you predicted from the files. Make sure to update the submit file with more accurate memory and disk requests. You may still want to request slightly more than the job actually used. Run the test job \u00b6 Once you have finished editing the submit file, go ahead and submit the job. It should take a few minutes to complete, and then you can check to make sure that no unwanted files (especially the pdbaa database files) were copied back at the end of the job. Run a du -sh on the directory with this job's input. How does it compare to the directory from Software exercise ( 1.1 ), and why? transfer_output_files \u00b6 So far, we have used HTCondor's new file detection to transfer back the newly created files. An alternative is to be explicit, using the transfer_output_files attribute in the submit file. The upside to this approach is that you can pick to only transfer back a subset of the created files. The downside is that you have to know which files are created. The first exercise is to modify the submit file from the previous example, and add a line like (remember, before the queue ): transfer_output_files = mouse.fa.result You may also remove the last line in the blast_wrapper.sh , the rm pdbaa.* as extra files are no longer an issue - those files will be ignored because we used transfer_output_files . Submit the job, and make sure everything works. Did you get any pdbaa.* files back? The next thing we should try is to see what happens if the file we specify does not exist. Modify your submit file, and change the transfer_output_files to: transfer_output_files = elephant.fa.result Submit the job and see how it behaves. Did it finish successfully? transfer_output_remaps \u00b6 Related to transfer_output_files is transfer_output_remaps , which allows us to rename outputs, or map the outputs to a different storage system (will be explored in the next module). The format of the transfer_output_remaps attribute is a list of remaps, each remap taking the form of src=dst . The destination can be a local path, or a URL. For example: transfer_output_remaps = \"myresults.dat = s3://destination-server.com/myresults.dat\" If you have more than one remap, you can separate them with ; By now, your blast-data directory is probably starting to look messy with a mix of submit files, input data, log file and output data all intermingled. One improvement could be to map our outputs to a separate directory. Create a new directory named science-results . Add a transfer_output_remaps line to the submit file. It is common to place this line right after the transfer_output_files line. Change the transfer_output_files back to mouse.fa.result . Example: transfer_output_files = mouse.fa.result transfer_output_remaps = Fill out the remap line, mapping mouse.fa.result to the destination science-results/mouse.fa.result . Remember that the transfer_output_remaps value requires double quotes around it. Submit the job, and wait for it to complete. Are there any errors? Can you find mouse.fa.result? Conclusions \u00b6 In this exercise, you: Used your data requirements knowledge from the previous exercise to write a job. Executed the job on a remote worker node and took note of the data usage. Used transfer_input_files to transfer inputs Used transfer_output_files to transfer outputs Used transfer_output_remaps to map outputs to a different destination When you've completed the above, continue with the next exercise .","title":"1.2 - transfer_input_files, transfer_output_files, and remaps"},{"location":"materials/data/part1-ex2-file-transfer/#data-exercise-12-transfer_input_files-transfer_output_files-and-remaps","text":"","title":"Data Exercise 1.2: transfer_input_files, transfer_output_files, and remaps"},{"location":"materials/data/part1-ex2-file-transfer/#exercise-goal","text":"The objective of this exercise is to refresh yourself on HTCondor file transfer, to implement file compression, and to begin examining the memory and disk space used by your jobs in order to plan larger batches. We will also explore ways to deal with output data.","title":"Exercise Goal"},{"location":"materials/data/part1-ex2-file-transfer/#setup","text":"The executable we'll use in this exercise and later today is the same blastx executable from previous exercises. Log in to ap40: $ ssh @ap40.uw.osg-htc.org Then change into the blast-data folder that you created in the previous exercise.","title":"Setup"},{"location":"materials/data/part1-ex2-file-transfer/#review-htcondor-file-transfer","text":"Recall that OSG does NOT have a shared filesystem! Instead, HTCondor transfers your executable and input files (specified with the executable and transfer_input_files submit file directives, respectively) to a working directory on the execute node, regardless of how these files were arranged on the submit node. In this exercise we'll use the same blastx example job that we used previously, but modify the submit file and test how much memory and disk space it uses on the execute node.","title":"Review: HTCondor File Transfer"},{"location":"materials/data/part1-ex2-file-transfer/#start-with-a-test-submit-file","text":"We've started a submit file for you, below, which you'll add to in the remaining steps. executable = transfer_input_files = output = test.out error = test.err log = test.log request_memory = request_disk = request_cpus = 1 requirements = (OSGVO_OS_STRING == \"RHEL 9\") queue","title":"Start with a test submit file"},{"location":"materials/data/part1-ex2-file-transfer/#implement-file-compression","text":"In our first blast job from the Software exercises ( 1.1 ), the database files in the pdbaa directory were all transferred, as is, but we could instead transfer them as a single, compressed file using tar . For this version of the job, let's compress our blast database files to send them to the submit node as a single tar.gz file (otherwise known as a tarball), by following the below steps: Change into the pdbaa directory and compress the database files into a single file called pdbaa_files.tar.gz using the tar command. Note that this file will be different from the pdbaa.tar.gz file that you used earlier, because it will only contain the pdbaa files, and not the pdbaa directory, itself.) Remember, a typical command for creating a tar file is: user@ap40 $ tar -cvzf Replacing with the name of the tarball that you would like to create and with a space-separated list of files and/or directories that you want inside pdbaa_files.tar.gz. Move the resulting tarball to the blast-data directory. Create a wrapper script that will first decompress the pdbaa_files.tar.gz file, and then run blast. Because this file will now be our executable in the submit file, we'll also end up transferring the blastx executable with transfer_input_files . In the blast-data directory, create a new file, called blast_wrapper.sh , with the following contents: #!/bin/bash tar -xzvf pdbaa_files.tar.gz ./blastx -db pdbaa -query mouse.fa -out mouse.fa.result rm pdbaa.* Also remember to make the script executable: chmod +x blast_wrapper.sh Extra Files! The last line removes the resulting database files that came from pdbaa_files.tar.gz , as these files would otherwise be copied back to the submit server as perceived output since they're \"new\" files that HTCondor didn't transfer over as input.","title":"Implement file compression"},{"location":"materials/data/part1-ex2-file-transfer/#list-the-executable-and-input-files","text":"Make sure to update the submit file with the following: Add the new executable (the wrapper script you created above) In transfer_input_files , list the blastx binary, the pdbaa_files.tar.gz file, and the input query file. Commas, commas everywhere! Remember that transfer_input_files accepts a comma separated list of files, and that you need to list the full location of the blastx executable ( blastx ). There will be no arguments, since the arguments to the blastx command are now captured in the wrapper script.","title":"List the executable and input files"},{"location":"materials/data/part1-ex2-file-transfer/#predict-memory-and-disk-requests-from-your-data","text":"Also, think about how much memory and disk to request for this job. It's good to start with values that are a little higher than you think a test job will need, but think about: How much memory blastx would use if it loaded all of the database files and the query input file into memory. How much disk space will be necessary on the execute server for the executable, all input files, and all output files (hint: the log file only exists on the submit node). Whether you'd like to request some extra memory or disk space, just in case Look at the log file for your blastx job from Software exercise ( 1.1 ), and compare the memory and disk \"Usage\" to what you predicted from the files. Make sure to update the submit file with more accurate memory and disk requests. You may still want to request slightly more than the job actually used.","title":"Predict memory and disk requests from your data"},{"location":"materials/data/part1-ex2-file-transfer/#run-the-test-job","text":"Once you have finished editing the submit file, go ahead and submit the job. It should take a few minutes to complete, and then you can check to make sure that no unwanted files (especially the pdbaa database files) were copied back at the end of the job. Run a du -sh on the directory with this job's input. How does it compare to the directory from Software exercise ( 1.1 ), and why?","title":"Run the test job"},{"location":"materials/data/part1-ex2-file-transfer/#transfer_output_files","text":"So far, we have used HTCondor's new file detection to transfer back the newly created files. An alternative is to be explicit, using the transfer_output_files attribute in the submit file. The upside to this approach is that you can pick to only transfer back a subset of the created files. The downside is that you have to know which files are created. The first exercise is to modify the submit file from the previous example, and add a line like (remember, before the queue ): transfer_output_files = mouse.fa.result You may also remove the last line in the blast_wrapper.sh , the rm pdbaa.* as extra files are no longer an issue - those files will be ignored because we used transfer_output_files . Submit the job, and make sure everything works. Did you get any pdbaa.* files back? The next thing we should try is to see what happens if the file we specify does not exist. Modify your submit file, and change the transfer_output_files to: transfer_output_files = elephant.fa.result Submit the job and see how it behaves. Did it finish successfully?","title":"transfer_output_files"},{"location":"materials/data/part1-ex2-file-transfer/#transfer_output_remaps","text":"Related to transfer_output_files is transfer_output_remaps , which allows us to rename outputs, or map the outputs to a different storage system (will be explored in the next module). The format of the transfer_output_remaps attribute is a list of remaps, each remap taking the form of src=dst . The destination can be a local path, or a URL. For example: transfer_output_remaps = \"myresults.dat = s3://destination-server.com/myresults.dat\" If you have more than one remap, you can separate them with ; By now, your blast-data directory is probably starting to look messy with a mix of submit files, input data, log file and output data all intermingled. One improvement could be to map our outputs to a separate directory. Create a new directory named science-results . Add a transfer_output_remaps line to the submit file. It is common to place this line right after the transfer_output_files line. Change the transfer_output_files back to mouse.fa.result . Example: transfer_output_files = mouse.fa.result transfer_output_remaps = Fill out the remap line, mapping mouse.fa.result to the destination science-results/mouse.fa.result . Remember that the transfer_output_remaps value requires double quotes around it. Submit the job, and wait for it to complete. Are there any errors? Can you find mouse.fa.result?","title":"transfer_output_remaps"},{"location":"materials/data/part1-ex2-file-transfer/#conclusions","text":"In this exercise, you: Used your data requirements knowledge from the previous exercise to write a job. Executed the job on a remote worker node and took note of the data usage. Used transfer_input_files to transfer inputs Used transfer_output_files to transfer outputs Used transfer_output_remaps to map outputs to a different destination When you've completed the above, continue with the next exercise .","title":"Conclusions"},{"location":"materials/data/part1-ex3-blast-split/","text":"Data Exercise 1.3: Splitting Large Input for Better Throughput \u00b6 The objective of this exercise is to prepare for blasting a much larger input query file by splitting the input for greater throughput and lower memory and disk requirements. Splitting the input will also mean that we don't have to rely on additional large-data measures for the input query files. Setup \u00b6 Log in to ap40.uw.osg-htc.org Create a directory for this exercise named blast-split and change into it. Copy over the following files from the previous exercise : Your submit file blastx pdbaa_files.tar.gz blast_wrapper.sh Remember to modify the submit file for the new locations of the above files. Obtain the large input \u00b6 We've previously used blastx to analyze a relatively small input file of test data, mouse.fa , but let's imagine that you now need to blast a much larger dataset for your research. This dataset can be downloaded with the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse_rna.tar.gz After un-tar'ing ( tar xzf mouse_rna.tar.gz ) the file, you should be able to confirm that it's size is roughly 100 MB. Not only is this near the size cutoff for HTCondor file transfer, it would take hours to complete a single blastx analysis for it and the resulting output file would be huge. Split the input file \u00b6 For blast , it's scientifically valid to split up the input query file, analyze the pieces, and then put the results back together at the end! On the other hand, BLAST databases should not be split, because the blast output includes a score value for each sequence that is calculated relative to the entire length of the database. Because genetic sequence data is used heavily across the life sciences, there are also tools for splitting up the data into smaller files. One of these is called genome tools , and you can download a package of precompiled binaries (just like BLAST) using the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/gt-1.5.10-Linux_x86_64-64bit-complete.tar.gz Un-tar the gt package ( tar -xzvf ... ), then run its sequence file splitter as follows, with the target file size of 1MB: user@ap40 $ ./gt-1.5.10-Linux_x86_64-64bit-complete/bin/gt splitfasta -targetsize 1 mouse_rna.fa You'll notice that the result is a set of 100 files, all about the size of 1 MB, and numbered 1 through 100. Run a Jobs on Split Input \u00b6 Now, you'll submit jobs on the split input files, where each job will use a different piece of the large original input file. Modify the submit file \u00b6 First, you'll create a new submit file that passes the input filename as an argument and use a list of applicable filenames. Follow the below steps: Copy the submit file from the previous exercise to a new file called blast_split.sub and modify the \"queue\" line of the submit file to the following: queue inputfile matching mouse_rna.fa.* Replace the mouse.fa instances in the submit file with $(inputfile) , and rename the output, log, and error files to use the same inputfile variable: output = $(inputfile).out error = $(inputfile).err log = $(inputfile).log Add an arguments line to the submit file so it will pass the name of the input file to the wrapper script arguments = $(inputfile) Add the $(inputfile) to the end of your list of transfer_input_files : transfer_input_files = ... , $(inputfile) Remove or comment out transfer_output_files and transfer_output_remaps . Update the memory and disk requests, since the new input file is larger and will also produce larger output. It may be best to overestimate to something like 1 GB for each. Modify the wrapper file \u00b6 Replace instances of the input file name in the blast_wrapper.sh script so that it will insert the first argument in place of the input filename, like so: ./blastx -db pdbaa -query $1 -out $1.result Note Bash shell scripts will use the first argument in place of $1 , the second argument as $2 , etc. Submit the jobs \u00b6 This job will take a bit longer than the job in the last exercise, since the input file is larger (by about 3-fold). Again, make sure that only the desired output , error , and result files come back at the end of the job. In our tests, the jobs ran for ~15 minutes. Jobs on jobs! Be careful to not submit the job again. Why? Our queue statement says ... matching mouse_rna.fa.* , and look at the current directory. There are new files named mouse_rna.fa.X.log and other files. Submitting again, the queue statement would see these new files, and try to run blast on them! If you want to remove all of the extra files, you can try: user@ap40 $ rm *.err *.log *.out *.result Update the resource requests \u00b6 After the job finishes successfully, examine the log file for memory and disk usage, and update the requests in the submit file.","title":"1.3- Splitting input"},{"location":"materials/data/part1-ex3-blast-split/#data-exercise-13-splitting-large-input-for-better-throughput","text":"The objective of this exercise is to prepare for blasting a much larger input query file by splitting the input for greater throughput and lower memory and disk requirements. Splitting the input will also mean that we don't have to rely on additional large-data measures for the input query files.","title":"Data Exercise 1.3: Splitting Large Input for Better Throughput"},{"location":"materials/data/part1-ex3-blast-split/#setup","text":"Log in to ap40.uw.osg-htc.org Create a directory for this exercise named blast-split and change into it. Copy over the following files from the previous exercise : Your submit file blastx pdbaa_files.tar.gz blast_wrapper.sh Remember to modify the submit file for the new locations of the above files.","title":"Setup"},{"location":"materials/data/part1-ex3-blast-split/#obtain-the-large-input","text":"We've previously used blastx to analyze a relatively small input file of test data, mouse.fa , but let's imagine that you now need to blast a much larger dataset for your research. This dataset can be downloaded with the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse_rna.tar.gz After un-tar'ing ( tar xzf mouse_rna.tar.gz ) the file, you should be able to confirm that it's size is roughly 100 MB. Not only is this near the size cutoff for HTCondor file transfer, it would take hours to complete a single blastx analysis for it and the resulting output file would be huge.","title":"Obtain the large input"},{"location":"materials/data/part1-ex3-blast-split/#split-the-input-file","text":"For blast , it's scientifically valid to split up the input query file, analyze the pieces, and then put the results back together at the end! On the other hand, BLAST databases should not be split, because the blast output includes a score value for each sequence that is calculated relative to the entire length of the database. Because genetic sequence data is used heavily across the life sciences, there are also tools for splitting up the data into smaller files. One of these is called genome tools , and you can download a package of precompiled binaries (just like BLAST) using the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/gt-1.5.10-Linux_x86_64-64bit-complete.tar.gz Un-tar the gt package ( tar -xzvf ... ), then run its sequence file splitter as follows, with the target file size of 1MB: user@ap40 $ ./gt-1.5.10-Linux_x86_64-64bit-complete/bin/gt splitfasta -targetsize 1 mouse_rna.fa You'll notice that the result is a set of 100 files, all about the size of 1 MB, and numbered 1 through 100.","title":"Split the input file"},{"location":"materials/data/part1-ex3-blast-split/#run-a-jobs-on-split-input","text":"Now, you'll submit jobs on the split input files, where each job will use a different piece of the large original input file.","title":"Run a Jobs on Split Input"},{"location":"materials/data/part1-ex3-blast-split/#modify-the-submit-file","text":"First, you'll create a new submit file that passes the input filename as an argument and use a list of applicable filenames. Follow the below steps: Copy the submit file from the previous exercise to a new file called blast_split.sub and modify the \"queue\" line of the submit file to the following: queue inputfile matching mouse_rna.fa.* Replace the mouse.fa instances in the submit file with $(inputfile) , and rename the output, log, and error files to use the same inputfile variable: output = $(inputfile).out error = $(inputfile).err log = $(inputfile).log Add an arguments line to the submit file so it will pass the name of the input file to the wrapper script arguments = $(inputfile) Add the $(inputfile) to the end of your list of transfer_input_files : transfer_input_files = ... , $(inputfile) Remove or comment out transfer_output_files and transfer_output_remaps . Update the memory and disk requests, since the new input file is larger and will also produce larger output. It may be best to overestimate to something like 1 GB for each.","title":"Modify the submit file"},{"location":"materials/data/part1-ex3-blast-split/#modify-the-wrapper-file","text":"Replace instances of the input file name in the blast_wrapper.sh script so that it will insert the first argument in place of the input filename, like so: ./blastx -db pdbaa -query $1 -out $1.result Note Bash shell scripts will use the first argument in place of $1 , the second argument as $2 , etc.","title":"Modify the wrapper file"},{"location":"materials/data/part1-ex3-blast-split/#submit-the-jobs","text":"This job will take a bit longer than the job in the last exercise, since the input file is larger (by about 3-fold). Again, make sure that only the desired output , error , and result files come back at the end of the job. In our tests, the jobs ran for ~15 minutes. Jobs on jobs! Be careful to not submit the job again. Why? Our queue statement says ... matching mouse_rna.fa.* , and look at the current directory. There are new files named mouse_rna.fa.X.log and other files. Submitting again, the queue statement would see these new files, and try to run blast on them! If you want to remove all of the extra files, you can try: user@ap40 $ rm *.err *.log *.out *.result","title":"Submit the jobs"},{"location":"materials/data/part1-ex3-blast-split/#update-the-resource-requests","text":"After the job finishes successfully, examine the log file for memory and disk usage, and update the requests in the submit file.","title":"Update the resource requests"},{"location":"materials/data/part2-ex1-osdf-inputs/","text":"Data Exercise 2.1: Using OSDF for Large Shared Data \u00b6 This exercise will use a BLAST workflow to demonstrate the functionality of OSDF for transferring input files to jobs on OSG. Because our individual blast jobs from previous exercises would take a bit longer with a larger database (too long for an workable exercise), we'll imagine for this exercise that our pdbaa_files.tar.gz file is too large for transfer_input_files (larger than ~1 GB). For this exercise, we will use the same inputs, but instead of using transfer_input_files for the pdbaa database, we will place it in OSDF and have the jobs download from there. OSDF is connected to a distributed set of caches spread across the U.S. They are connected with high bandwidth connections to each other, and to the data origin servers, where your data is originally placed. Setup \u00b6 Make sure you're logged in to ap40.uw.osg-htc.org Copy the following files from the previous Blast exercises to a new directory in /home/ called osdf-shared : blast_wrapper.sh blastx mouse_rna.fa.1 mouse_rna.fa.2 mouse_rna.fa.3 Your most recent submit file (probably named blast_split.sub ) Place the Database in OSDF \u00b6 Copy to your data to the OSDF space \u00b6 OSDF provides a directory for you to store data which can be accessed through the caching servers. First, you need to move your BLAST database ( pdbaa_files.tar.gz ) into this directory. For ap40.uw.osg-htc.org , the directory to use is /ospool/ap40/data/[USERNAME]/ Note that files placed in the /ospool/ap40/data/[USERNAME]/ directory will only be accessible by your own jobs. Modify the Submit File and Wrapper \u00b6 You will have to modify the wrapper and submit file to use OSDF: HTCondor knows how to do OSDF transfers, so you just have to provide the correct URL in transfer_input_files . Note there is no servername (3 slashes in :///) and we instead is is just based on namespace ( /ospool/ap40 in this case): transfer_input_files = blastx, $(inputfile), osdf:///ospool/ap40/data/[USERNAME]/pdbaa_files.tar.gz Confirm that your queue statement is correct for the current directory. It should be something like: queue inputfile matching mouse_rna.fa.* And that mouse_rna.fa.* files exist in the current directory (you should have copied a few them from the previous exercise directory). Submit the Job \u00b6 Now submit and monitor the job! If your 100 jobs from the previous exercise haven't started running yet, this job will not yet start. However, after it has been running for ~2 minutes, you're safe to continue to the next exercise! Considerations \u00b6 Why did we not place all files in OSDF (for example, blastx and mouse_rna.fa.* )? What do you think will happen if you make changes to pdbaa_files.tar.gz ? Will the caches be updated automatically, or is there a possiblility that the old version of pdbaa_files.tar.gz will be served up to jobs? What is the solution to this problem? (Hint: OSDF only considers the filename when caching data) Note: Keeping OSDF 'Clean' \u00b6 Just as for any data directory, it is VERY important to remove old files from OSDF when you no longer need them, especially so that you'll have plenty of space for such files in the future. For example, you would delete ( rm ) files from /ospool/ap40/data/[USERNAME]/ on when you don't need them there anymore, but only after all jobs have finished. The next time you use OSDF after the school, remember to first check for old files that you can delete. Next exercise \u00b6 Once completed, move onto the next exercise: Using OSDF for outputs","title":"2.1 - OSDF for inputs"},{"location":"materials/data/part2-ex1-osdf-inputs/#data-exercise-21-using-osdf-for-large-shared-data","text":"This exercise will use a BLAST workflow to demonstrate the functionality of OSDF for transferring input files to jobs on OSG. Because our individual blast jobs from previous exercises would take a bit longer with a larger database (too long for an workable exercise), we'll imagine for this exercise that our pdbaa_files.tar.gz file is too large for transfer_input_files (larger than ~1 GB). For this exercise, we will use the same inputs, but instead of using transfer_input_files for the pdbaa database, we will place it in OSDF and have the jobs download from there. OSDF is connected to a distributed set of caches spread across the U.S. They are connected with high bandwidth connections to each other, and to the data origin servers, where your data is originally placed.","title":"Data Exercise 2.1: Using OSDF for Large Shared Data"},{"location":"materials/data/part2-ex1-osdf-inputs/#setup","text":"Make sure you're logged in to ap40.uw.osg-htc.org Copy the following files from the previous Blast exercises to a new directory in /home/ called osdf-shared : blast_wrapper.sh blastx mouse_rna.fa.1 mouse_rna.fa.2 mouse_rna.fa.3 Your most recent submit file (probably named blast_split.sub )","title":"Setup"},{"location":"materials/data/part2-ex1-osdf-inputs/#place-the-database-in-osdf","text":"","title":"Place the Database in OSDF"},{"location":"materials/data/part2-ex1-osdf-inputs/#copy-to-your-data-to-the-osdf-space","text":"OSDF provides a directory for you to store data which can be accessed through the caching servers. First, you need to move your BLAST database ( pdbaa_files.tar.gz ) into this directory. For ap40.uw.osg-htc.org , the directory to use is /ospool/ap40/data/[USERNAME]/ Note that files placed in the /ospool/ap40/data/[USERNAME]/ directory will only be accessible by your own jobs.","title":"Copy to your data to the OSDF space"},{"location":"materials/data/part2-ex1-osdf-inputs/#modify-the-submit-file-and-wrapper","text":"You will have to modify the wrapper and submit file to use OSDF: HTCondor knows how to do OSDF transfers, so you just have to provide the correct URL in transfer_input_files . Note there is no servername (3 slashes in :///) and we instead is is just based on namespace ( /ospool/ap40 in this case): transfer_input_files = blastx, $(inputfile), osdf:///ospool/ap40/data/[USERNAME]/pdbaa_files.tar.gz Confirm that your queue statement is correct for the current directory. It should be something like: queue inputfile matching mouse_rna.fa.* And that mouse_rna.fa.* files exist in the current directory (you should have copied a few them from the previous exercise directory).","title":"Modify the Submit File and Wrapper"},{"location":"materials/data/part2-ex1-osdf-inputs/#submit-the-job","text":"Now submit and monitor the job! If your 100 jobs from the previous exercise haven't started running yet, this job will not yet start. However, after it has been running for ~2 minutes, you're safe to continue to the next exercise!","title":"Submit the Job"},{"location":"materials/data/part2-ex1-osdf-inputs/#considerations","text":"Why did we not place all files in OSDF (for example, blastx and mouse_rna.fa.* )? What do you think will happen if you make changes to pdbaa_files.tar.gz ? Will the caches be updated automatically, or is there a possiblility that the old version of pdbaa_files.tar.gz will be served up to jobs? What is the solution to this problem? (Hint: OSDF only considers the filename when caching data)","title":"Considerations"},{"location":"materials/data/part2-ex1-osdf-inputs/#note-keeping-osdf-clean","text":"Just as for any data directory, it is VERY important to remove old files from OSDF when you no longer need them, especially so that you'll have plenty of space for such files in the future. For example, you would delete ( rm ) files from /ospool/ap40/data/[USERNAME]/ on when you don't need them there anymore, but only after all jobs have finished. The next time you use OSDF after the school, remember to first check for old files that you can delete.","title":"Note: Keeping OSDF 'Clean'"},{"location":"materials/data/part2-ex1-osdf-inputs/#next-exercise","text":"Once completed, move onto the next exercise: Using OSDF for outputs","title":"Next exercise"},{"location":"materials/data/part2-ex2-osdf-outputs/","text":"Data Exercise 2.2: Using OSDF for outputs \u00b6 In this exercise, we will run a multimedia program that converts and manipulates video files. In particular, we want to convert large .mov files to smaller (10-100s of MB) mp4 files. Just like the Blast database in the previous exercise , these video files are potentially too large to send to jobs using HTCondor's default file transfer for inputs/outputs, so we will use OSDF. Data \u00b6 To get the exercise set up: Log into ap40.uw.osg-htc.org Create a directory for this exercise named osdf-outputs and change into it. Download the input data and store it under the OSDF directory ( cd to that directory first): user@ap40 $ cd /ospool/ap40/data/ [ USERNAME ] / user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ducks.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/teaching.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/test_open_terminal.mov We're going to need a list of these files later. Below is the final list of movie files. cd back to your osdf-outputs directory and create a file named movie_list.txt , with the following content: ducks.mov teaching.mov test_open_terminal.mov Software \u00b6 We'll be using a multi-purpose media tool called ffmpeg to convert video formats. The basic command to convert a file looks like this: user@ap40 $ ./ffmpeg -i input.mov output.mp4 In order to resize our files, we're going to manually set the video bitrate and resize the frames, so that the resulting file is smaller. user@ap40 $ ./ffmpeg -i input.mp4 -b:v 400k -s 640x360 output.mp4 To get the ffmpeg binary do the following: We'll be downloading the ffmpeg pre-built static binary originally from this page: http://johnvansickle.com/ffmpeg/ . user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ffmpeg-release-64bit-static.tar.xz Once the binary is downloaded, un-tar it, and then copy the main ffmpeg program into your current directory: user@ap40 $ tar -xf ffmpeg-release-64bit-static.tar.xz user@ap40 $ cp ffmpeg-4.0.1-64bit-static/ffmpeg ./ Script \u00b6 We want to write a script that runs on the worker node that uses ffmpeg to convert a .mov file to a smaller format. Our script will need to run the proper executable. Create a file called run_ffmpeg.sh , that does the steps described above. Use the name of the smallest .mov file in the ffmpeg command. An example of that script is below: #!/bin/bash ./ffmpeg -i test_open_terminal.mov -b:v 400k -s 640x360 test_open_terminal.mp4 Ultimately we'll want to submit several jobs (one for each .mov file), but to start with, we'll run one job to make sure that everything works. Remember to chmod +x run_ffmpeg.sh to make the script executable. Submit File \u00b6 Create a submit file for this job, based on other submit files from the school. Things to consider: We'll be copying the video file into the job's working directory from OSDF, so make sure to request enough disk space for the input mov file and the output mp4 file. If you're aren't sure how much to request, ask a helper. Add the same requirements as the previous exercise: requirements = (OSGVO_OS_STRING == \"RHEL 9\") We need to transfer the ffmpeg program that we downloaded above, and the movie from OSDF: transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mov Transfer outputs via OSDF. This requires a transfer remap: transfer_output_files = test_open_terminal.mp4 transfer_output_remaps = \"test_open_terminal.mp4 = osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mp4\" Initial Job \u00b6 With everything in place, submit the job. Once it finishes, we should check to make sure everything ran as expected: Check the OSDF directory. Did the output .mp4 file return? Check file sizes. How big is the returned .mp4 file? How does that compare to the original .mov input? If your job successfully returned the converted .mp4 file and did not transfer the .mov file to the submit server, and the .mp4 file was appropriately scaled down, then we can go ahead and convert all of the files we uploaded to OSDF. Multiple jobs \u00b6 We wrote the name of the .mov file into our run_ffmpeg.sh executable script. To submit a set of jobs for all of our .mov files, what will we need to change in: The script? The submit file? Once you've thought about it, check your reasoning against the instructions below. Add an argument to your script \u00b6 Look at your run_ffmpeg.sh script. What values will change for every job? The input file will change with every job - and don't forget that the output file will too! Let's make them both into arguments. To add arguments to a bash script, we use the notation $1 for the first argument (our input file) and $2 for the second argument (our output file name). The final script should look like this: #!/bin/bash ./ffmpeg -i $1 -b:v 400k -s 640x360 $2 Modify your submit file \u00b6 We now need to tell each job what arguments to use. We will do this by adding an arguments line to our submit file. Because we'll only have the input file name, the \"output\" file name will be the input file name with the mp4 extension. That should look like this: arguments = $(mov) $(mov).mp4 Update the transfer_input_files to have $(mov) : transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/$(mov) Similarly, update the output/remap with $(mov).mp4 : transfer_output_files = $(mov).mp4 transfer_output_remaps = \"$(mov).mp4 = osdf:///ospool/ap40/data/[USERNAME]/$(mov).mp4\" To set these arguments, we will use the queue .. from syntax. In our submit file, we can then change our queue statement to: queue mov from movie_list.txt Once you've made these changes, try submitting all the jobs! Bonus \u00b6 If you wanted to set a different output file name, bitrate and/or size for each original movie, how could you modify: movie_list.txt Your submit file run_ffmpeg.sh to do so? Show hint Here's the changes you can make to the various files: movie_list.txt ducks.mov ducks.mp4 500k 1280x720 teaching.mov teaching.mp4 400k 320x180 test_open_terminal.mov terminal.mp4 600k 640x360 Submit file arguments = $(mov) $(mp4) $(bitrate) $(size) queue mov,mp4,bitrate,size from movie_list.txt run_ffmpeg.sh 1 2 #!/bin/bash ./ffmpeg -i $1 -b:v $3 -s $4 $2","title":"2.2 - OSDF for outputs"},{"location":"materials/data/part2-ex2-osdf-outputs/#data-exercise-22-using-osdf-for-outputs","text":"In this exercise, we will run a multimedia program that converts and manipulates video files. In particular, we want to convert large .mov files to smaller (10-100s of MB) mp4 files. Just like the Blast database in the previous exercise , these video files are potentially too large to send to jobs using HTCondor's default file transfer for inputs/outputs, so we will use OSDF.","title":"Data Exercise 2.2: Using OSDF for outputs"},{"location":"materials/data/part2-ex2-osdf-outputs/#data","text":"To get the exercise set up: Log into ap40.uw.osg-htc.org Create a directory for this exercise named osdf-outputs and change into it. Download the input data and store it under the OSDF directory ( cd to that directory first): user@ap40 $ cd /ospool/ap40/data/ [ USERNAME ] / user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ducks.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/teaching.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/test_open_terminal.mov We're going to need a list of these files later. Below is the final list of movie files. cd back to your osdf-outputs directory and create a file named movie_list.txt , with the following content: ducks.mov teaching.mov test_open_terminal.mov","title":"Data"},{"location":"materials/data/part2-ex2-osdf-outputs/#software","text":"We'll be using a multi-purpose media tool called ffmpeg to convert video formats. The basic command to convert a file looks like this: user@ap40 $ ./ffmpeg -i input.mov output.mp4 In order to resize our files, we're going to manually set the video bitrate and resize the frames, so that the resulting file is smaller. user@ap40 $ ./ffmpeg -i input.mp4 -b:v 400k -s 640x360 output.mp4 To get the ffmpeg binary do the following: We'll be downloading the ffmpeg pre-built static binary originally from this page: http://johnvansickle.com/ffmpeg/ . user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ffmpeg-release-64bit-static.tar.xz Once the binary is downloaded, un-tar it, and then copy the main ffmpeg program into your current directory: user@ap40 $ tar -xf ffmpeg-release-64bit-static.tar.xz user@ap40 $ cp ffmpeg-4.0.1-64bit-static/ffmpeg ./","title":"Software"},{"location":"materials/data/part2-ex2-osdf-outputs/#script","text":"We want to write a script that runs on the worker node that uses ffmpeg to convert a .mov file to a smaller format. Our script will need to run the proper executable. Create a file called run_ffmpeg.sh , that does the steps described above. Use the name of the smallest .mov file in the ffmpeg command. An example of that script is below: #!/bin/bash ./ffmpeg -i test_open_terminal.mov -b:v 400k -s 640x360 test_open_terminal.mp4 Ultimately we'll want to submit several jobs (one for each .mov file), but to start with, we'll run one job to make sure that everything works. Remember to chmod +x run_ffmpeg.sh to make the script executable.","title":"Script"},{"location":"materials/data/part2-ex2-osdf-outputs/#submit-file","text":"Create a submit file for this job, based on other submit files from the school. Things to consider: We'll be copying the video file into the job's working directory from OSDF, so make sure to request enough disk space for the input mov file and the output mp4 file. If you're aren't sure how much to request, ask a helper. Add the same requirements as the previous exercise: requirements = (OSGVO_OS_STRING == \"RHEL 9\") We need to transfer the ffmpeg program that we downloaded above, and the movie from OSDF: transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mov Transfer outputs via OSDF. This requires a transfer remap: transfer_output_files = test_open_terminal.mp4 transfer_output_remaps = \"test_open_terminal.mp4 = osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mp4\"","title":"Submit File"},{"location":"materials/data/part2-ex2-osdf-outputs/#initial-job","text":"With everything in place, submit the job. Once it finishes, we should check to make sure everything ran as expected: Check the OSDF directory. Did the output .mp4 file return? Check file sizes. How big is the returned .mp4 file? How does that compare to the original .mov input? If your job successfully returned the converted .mp4 file and did not transfer the .mov file to the submit server, and the .mp4 file was appropriately scaled down, then we can go ahead and convert all of the files we uploaded to OSDF.","title":"Initial Job"},{"location":"materials/data/part2-ex2-osdf-outputs/#multiple-jobs","text":"We wrote the name of the .mov file into our run_ffmpeg.sh executable script. To submit a set of jobs for all of our .mov files, what will we need to change in: The script? The submit file? Once you've thought about it, check your reasoning against the instructions below.","title":"Multiple jobs"},{"location":"materials/data/part2-ex2-osdf-outputs/#add-an-argument-to-your-script","text":"Look at your run_ffmpeg.sh script. What values will change for every job? The input file will change with every job - and don't forget that the output file will too! Let's make them both into arguments. To add arguments to a bash script, we use the notation $1 for the first argument (our input file) and $2 for the second argument (our output file name). The final script should look like this: #!/bin/bash ./ffmpeg -i $1 -b:v 400k -s 640x360 $2","title":"Add an argument to your script"},{"location":"materials/data/part2-ex2-osdf-outputs/#modify-your-submit-file","text":"We now need to tell each job what arguments to use. We will do this by adding an arguments line to our submit file. Because we'll only have the input file name, the \"output\" file name will be the input file name with the mp4 extension. That should look like this: arguments = $(mov) $(mov).mp4 Update the transfer_input_files to have $(mov) : transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/$(mov) Similarly, update the output/remap with $(mov).mp4 : transfer_output_files = $(mov).mp4 transfer_output_remaps = \"$(mov).mp4 = osdf:///ospool/ap40/data/[USERNAME]/$(mov).mp4\" To set these arguments, we will use the queue .. from syntax. In our submit file, we can then change our queue statement to: queue mov from movie_list.txt Once you've made these changes, try submitting all the jobs!","title":"Modify your submit file"},{"location":"materials/data/part2-ex2-osdf-outputs/#bonus","text":"If you wanted to set a different output file name, bitrate and/or size for each original movie, how could you modify: movie_list.txt Your submit file run_ffmpeg.sh to do so? Show hint Here's the changes you can make to the various files: movie_list.txt ducks.mov ducks.mp4 500k 1280x720 teaching.mov teaching.mp4 400k 320x180 test_open_terminal.mov terminal.mp4 600k 640x360 Submit file arguments = $(mov) $(mp4) $(bitrate) $(size) queue mov,mp4,bitrate,size from movie_list.txt run_ffmpeg.sh 1 2 #!/bin/bash ./ffmpeg -i $1 -b:v $3 -s $4 $2","title":"Bonus"},{"location":"materials/htcondor/part1-ex1-login/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.1: Log In and Look Around \u00b6 Background \u00b6 There are different High Throughput Computing (HTC) systems at universities, government facilities, and other institutions around the world, and they may have different user experiences. For example, some systems have dedicated resources (which means your job will be guaranteed a certain amount of resources/time to complete), while other systems have opportunistic, backfill resources (which means your job can take advantage of some resources, but those resources could be removed at any time). Other systems have a mix of dedicated and opportunistic resources. Durring the OSG School, you will practice on two different HTC systems: the \" PATh Facility \" and \" OSG's Open Science Pool (OSPool) \". This will help prepare you for working on a variety of different HTC systems. PATh Facility: The PATh Facility provides researchers with dedicated HTC resources and the ability to run larger and longer jobs . The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. OSG's Open Science Pool: The OSPool provides researchers with opportunitistic resources and the ability to run many smaller and shorter jobs silmnulatinously . The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs. Exercise Goal \u00b6 The goal of this first exercise is to log in to the PATh Facility access point and look around a little bit, which will take only a few minutes. If you have trouble getting SSH access to the submit server, ask the instructors right away! Gaining access is critical for all remaining exercises. Logging In \u00b6 Today, you will use a High Throughput Computing system known as the \" PATh Facility \". The PATh Facility provides users with dedicated resources and longer runtimes than OSG's Open Science Pool. You will login to the access point of the PATh Facility, which is called ap1.facility.path-cc.io using the username you previously created. To log in, use a Secure Shell (SSH) client. From a Mac or Linux computer, start the Terminal app and run the below ssh command, replacing with your username: $ ssh @ap1.facility.path-cc.io On Windows, we recommend a free client called PuTTY , but any SSH client should be fine. If you need help finding or using an SSH client, ask the instructors for help right away ! Running Commands \u00b6 In the exercises, we will show commands that you are supposed to type or copy into the command line, like this: username@ap1 $ hostname path-ap2001 Note In the first line of the example above, the username@ap1 $ part is meant to show the Linux command-line prompt. You do not type this part! Further, your actual prompt probably is a bit different, and that is expected. So in the example above, the command that you type at your own prompt is just the eight characters hostname . The second line of the example, without the prompt, shows the output of the command; you do not type this part, either. Here are a few other commands that you can try (the examples below do not show the output from each command): username@ap1 $ whoami username@ap1 $ date username@ap1 $ uname -a A suggestion for the day: try typing into the command line as many of the commands as you can. Copy-and-paste is fine, of course, but you WILL learn more if you take the time to type each command yourself. Organizing Your Workspace \u00b6 You will be doing many different exercises over the next few days, many of them on this access point. Each exercise may use many files, once finished. To avoid confusion, it may be useful to create a separate directory for each exercise. For instance, for the rest of this exercise, you may wish to create and use a directory named intro-1.1-login , or something like that. username@ap1 $ mkdir intro-1.1-login username@ap1 $ cd intro-1.1-login Showing the Version of HTCondor \u00b6 HTCondor is installed on this server. But what version? You can ask HTCondor itself: username@ap1 $ condor_version $ CondorVersion: 23 .9.0 2024 -06-27 BuildID: 742143 PackageID: 23 .9.0-0.742143 GitSHA: 68fde429 RC $ $ CondorPlatform: x86_64_AlmaLinux8 $ As you can see from the output, we are using HTCondor 10.7.0. Reference Materials \u00b6 Here are a few links to reference materials that might be interesting after the school (or perhaps during). HTCondor manuals ; it is probably best to read the manual corresponding to the version of HTCondor that you use. That link points to the latest version of the manual, but you can switch versions using the toggle in the lower left corner of that page.","title":"1.1 - Log in and look around"},{"location":"materials/htcondor/part1-ex1-login/#htc-exercise-11-log-in-and-look-around","text":"","title":"HTC Exercise 1.1: Log In and Look Around"},{"location":"materials/htcondor/part1-ex1-login/#background","text":"There are different High Throughput Computing (HTC) systems at universities, government facilities, and other institutions around the world, and they may have different user experiences. For example, some systems have dedicated resources (which means your job will be guaranteed a certain amount of resources/time to complete), while other systems have opportunistic, backfill resources (which means your job can take advantage of some resources, but those resources could be removed at any time). Other systems have a mix of dedicated and opportunistic resources. Durring the OSG School, you will practice on two different HTC systems: the \" PATh Facility \" and \" OSG's Open Science Pool (OSPool) \". This will help prepare you for working on a variety of different HTC systems. PATh Facility: The PATh Facility provides researchers with dedicated HTC resources and the ability to run larger and longer jobs . The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. OSG's Open Science Pool: The OSPool provides researchers with opportunitistic resources and the ability to run many smaller and shorter jobs silmnulatinously . The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs.","title":"Background"},{"location":"materials/htcondor/part1-ex1-login/#exercise-goal","text":"The goal of this first exercise is to log in to the PATh Facility access point and look around a little bit, which will take only a few minutes. If you have trouble getting SSH access to the submit server, ask the instructors right away! Gaining access is critical for all remaining exercises.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex1-login/#logging-in","text":"Today, you will use a High Throughput Computing system known as the \" PATh Facility \". The PATh Facility provides users with dedicated resources and longer runtimes than OSG's Open Science Pool. You will login to the access point of the PATh Facility, which is called ap1.facility.path-cc.io using the username you previously created. To log in, use a Secure Shell (SSH) client. From a Mac or Linux computer, start the Terminal app and run the below ssh command, replacing with your username: $ ssh @ap1.facility.path-cc.io On Windows, we recommend a free client called PuTTY , but any SSH client should be fine. If you need help finding or using an SSH client, ask the instructors for help right away !","title":"Logging In"},{"location":"materials/htcondor/part1-ex1-login/#running-commands","text":"In the exercises, we will show commands that you are supposed to type or copy into the command line, like this: username@ap1 $ hostname path-ap2001 Note In the first line of the example above, the username@ap1 $ part is meant to show the Linux command-line prompt. You do not type this part! Further, your actual prompt probably is a bit different, and that is expected. So in the example above, the command that you type at your own prompt is just the eight characters hostname . The second line of the example, without the prompt, shows the output of the command; you do not type this part, either. Here are a few other commands that you can try (the examples below do not show the output from each command): username@ap1 $ whoami username@ap1 $ date username@ap1 $ uname -a A suggestion for the day: try typing into the command line as many of the commands as you can. Copy-and-paste is fine, of course, but you WILL learn more if you take the time to type each command yourself.","title":"Running Commands"},{"location":"materials/htcondor/part1-ex1-login/#organizing-your-workspace","text":"You will be doing many different exercises over the next few days, many of them on this access point. Each exercise may use many files, once finished. To avoid confusion, it may be useful to create a separate directory for each exercise. For instance, for the rest of this exercise, you may wish to create and use a directory named intro-1.1-login , or something like that. username@ap1 $ mkdir intro-1.1-login username@ap1 $ cd intro-1.1-login","title":"Organizing Your Workspace"},{"location":"materials/htcondor/part1-ex1-login/#showing-the-version-of-htcondor","text":"HTCondor is installed on this server. But what version? You can ask HTCondor itself: username@ap1 $ condor_version $ CondorVersion: 23 .9.0 2024 -06-27 BuildID: 742143 PackageID: 23 .9.0-0.742143 GitSHA: 68fde429 RC $ $ CondorPlatform: x86_64_AlmaLinux8 $ As you can see from the output, we are using HTCondor 10.7.0.","title":"Showing the Version of HTCondor"},{"location":"materials/htcondor/part1-ex1-login/#reference-materials","text":"Here are a few links to reference materials that might be interesting after the school (or perhaps during). HTCondor manuals ; it is probably best to read the manual corresponding to the version of HTCondor that you use. That link points to the latest version of the manual, but you can switch versions using the toggle in the lower left corner of that page.","title":"Reference Materials"},{"location":"materials/htcondor/part1-ex2-commands/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.2: Experiment With HTCondor Commands \u00b6 Exercise Goal \u00b6 The goal of this exercise is to learn about two very important HTCondor commands, condor_q and condor_status . They will be useful for monitoring your jobs and available execute point slots (respectively) throughout the week. This exercise should take only a few minutes. Viewing Slots \u00b6 As discussed in the lecture, the condor_status command is used to view the current state of slots in an HTCondor pool. At its most basic, the command is: username@ap1 $ condor_status When running this command, there is typically a lot of output printed to the screen. Looking at your terminal output, there is one line per execute point slot. TIP: You can widen your terminal window, which may help you to see all details of the output better. Here is some example output (what you see will be longer): slot1@FIU-PATH-EP.osgvo-docker-pilot-55c74f5b7c-kbs77 LINUX X86_64 Unclaimed Idle 0.000 8053 0+01:14:34 slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n LINUX X86_64 Claimed Busy 0.930 1024 0+02:42:08 slot1@WISC-PATH-EP.osgvo-docker-pilot-7b46dbdbb7-xqkkg LINUX X86_64 Claimed Busy 3.530 1024 0+02:40:24 slot1@SYRA-PATH-EP.osgvo-docker-pilot-gpu-7f6c64d459 LINUX X86_64 Owner Idle 0.300 250 7+03:22:21 This output consists of 8 columns: Col Example Meaning Name slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n Full slot name (including the hostname) OpSys LINUX Operating system Arch X86_64 Slot architecture (e.g., Intel 64 bit) State Claimed State of the slot ( Unclaimed is available, Owner is being used by the machine owner, Claimed is matched to a job) Activity Busy Is there activity on the slot? LoadAv 0.930 Load average, a measure of CPU activity on the slot Mem 1024 Memory available to the slot, in MB ActvtyTime 0+02:42:08 Amount of time spent in current activity (days + hours:minutes:seconds) At the end of the slot listing, there is a summary. Here is an example: Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 10831 0 10194 631 0 0 6 X86_64/WINDOWS 2 2 0 0 0 0 0 Total 10833 2 10194 631 0 0 6 There is one row of summary for each machine (i.e. \"slot\") architecture/operating system combination with columns for the number of slots in each state. The final row gives a summary of slot states for the whole pool. Questions: \u00b6 When you run condor_status , how many 64-bit Linux slots are available? (Hint: Unclaimed = available.) What percent of the total slots are currently claimed by a job? (Note: there is a rapid turnover of slots, which is what allows users with new submission to have jobs start quickly.) How have these numbers changed (if at all) when you run the condor_status command again? Viewing Whole Machines, Only \u00b6 Also try out the -compact for a slightly different view of whole machines (i.e. server hostnames), without the individual slots shown. username@ap1 $ condor_status -compact How has the column information changed? Viewing Jobs \u00b6 The condor_q command lists jobs that are on this access point machine and that are running or waiting to run. The _q part of the name is meant to suggest the word \u201cqueue\u201d, or list of job sets waiting to finish. Viewing Your Own Jobs \u00b6 The default behavior of the command lists only your jobs: username@ap1 $ condor_q The main part of the output (which will be empty, because you haven't submitted jobs yet) shows one set (\"batch\") of submitted jobs per line. If you had a single job in the queue, it would look something like the below: -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 09:59:31 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice CMD: run_ffmpeg.sh 7/12 09:58 _ _ 1 1 18801.0 This output consists of 8 (or 9) columns: Col Example Meaning OWNER alice The user ID of the user who submitted the job BATCH_NAME run_ffmpeg.sh The executable or \"jobbatchname\" specified within the submit file(s) SUBMITTED 7/12 09:58 The date and time when the job was submitted DONE _ Number of jobs in this batch that have completed RUN _ Number of jobs in this batch that are currently running IDLE 1 Number of jobs in this batch that are idle, waiting for a match HOLD _ Column will show up if there are jobs on \"hold\" because something about the submission/setup needs to be corrected by the user TOTAL 1 Total number of jobs in this batch JOB_IDS 18801.0 Job ID or range of Job IDs in this batch At the end of the job listing, there is a summary. Here is a sample: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended It shows total counts of jobs in the different possible states. Questions: For the sample above, when was the job submitted? For the sample above, was the job running or not yet? How can you tell? Viewing Everyone\u2019s Jobs \u00b6 By default, the condor_q command shows your jobs only. To see everyone\u2019s jobs that are queued on the machine, add the -all option: username@ap1 $ condor_q -all How many jobs are queued in total (i.e., running or waiting to run)? How many jobs from this submit machine are running right now? Viewing Jobs without the Default \"batch\" Mode \u00b6 The condor_q output, by default, groups \"batches\" of jobs together (if they were submitted with the same submit file or \"jobbatchname\"). To see more information for EVERY job on a separate line of output, use the -nobatch option to condor_q : username@ap1 $ condor_q -all -nobatch How has the column information changed? (Below is an example of the top of the output.) -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 11:58:44 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18203.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18204.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18801.0 alice 7/12 09:58 0+00:00:00 I 0 0.0 run_ffmpeg.sh 18997.0 s16_martincum 7/12 10:59 0+00:00:32 I 0 733.0 runR.pl 1_0 run_perm.R 1 0 10 19027.5 s16_martincum 7/12 11:06 0+00:09:20 I 0 2198.0 runR.pl 1_5 run_perm.R 1 5 1000 The -nobatch output shows a line for every job and consists of 8 columns: Col Example Meaning ID 18801.0 Job ID, which is the cluster , a dot character ( . ), and the process OWNER alice The user ID of the user who submitted the job SUBMITTED 7/12 09:58 The date and time when the job was submitted RUN_TIME 0+00:00:00 Total time spent running so far (days + hours:minutes:seconds) ST I Status of job: I is Idle (waiting to run), R is Running, H is Held, etc. PRI 0 Job priority (see next lecture) SIZE 0.0 Current run-time memory usage, in MB CMD run_ffmpeg.sh The executable command (with arguments) to be run In future exercises, you'll want to switch between condor_q and condor_q -nobatch to see different types of information about YOUR jobs. Extra Information \u00b6 Both condor_status and condor_q have many command-line options, some of which significantly change their output. You will explore a few of the most useful options in future exercises, but if you want to experiment now, go ahead! There are a few ways to learn more about the commands: Use the (brief) built-in help for the commands, e.g.: condor_q -h Read the installed man(ual) pages for the commands, e.g.: man condor_q Find the command in the online manual ; note: the text online is the same as the man text, only formatted for the web","title":"1.2 - Experiment with HTCondor commands"},{"location":"materials/htcondor/part1-ex2-commands/#htc-exercise-12-experiment-with-htcondor-commands","text":"","title":"HTC Exercise 1.2: Experiment With HTCondor Commands"},{"location":"materials/htcondor/part1-ex2-commands/#exercise-goal","text":"The goal of this exercise is to learn about two very important HTCondor commands, condor_q and condor_status . They will be useful for monitoring your jobs and available execute point slots (respectively) throughout the week. This exercise should take only a few minutes.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-slots","text":"As discussed in the lecture, the condor_status command is used to view the current state of slots in an HTCondor pool. At its most basic, the command is: username@ap1 $ condor_status When running this command, there is typically a lot of output printed to the screen. Looking at your terminal output, there is one line per execute point slot. TIP: You can widen your terminal window, which may help you to see all details of the output better. Here is some example output (what you see will be longer): slot1@FIU-PATH-EP.osgvo-docker-pilot-55c74f5b7c-kbs77 LINUX X86_64 Unclaimed Idle 0.000 8053 0+01:14:34 slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n LINUX X86_64 Claimed Busy 0.930 1024 0+02:42:08 slot1@WISC-PATH-EP.osgvo-docker-pilot-7b46dbdbb7-xqkkg LINUX X86_64 Claimed Busy 3.530 1024 0+02:40:24 slot1@SYRA-PATH-EP.osgvo-docker-pilot-gpu-7f6c64d459 LINUX X86_64 Owner Idle 0.300 250 7+03:22:21 This output consists of 8 columns: Col Example Meaning Name slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n Full slot name (including the hostname) OpSys LINUX Operating system Arch X86_64 Slot architecture (e.g., Intel 64 bit) State Claimed State of the slot ( Unclaimed is available, Owner is being used by the machine owner, Claimed is matched to a job) Activity Busy Is there activity on the slot? LoadAv 0.930 Load average, a measure of CPU activity on the slot Mem 1024 Memory available to the slot, in MB ActvtyTime 0+02:42:08 Amount of time spent in current activity (days + hours:minutes:seconds) At the end of the slot listing, there is a summary. Here is an example: Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 10831 0 10194 631 0 0 6 X86_64/WINDOWS 2 2 0 0 0 0 0 Total 10833 2 10194 631 0 0 6 There is one row of summary for each machine (i.e. \"slot\") architecture/operating system combination with columns for the number of slots in each state. The final row gives a summary of slot states for the whole pool.","title":"Viewing Slots"},{"location":"materials/htcondor/part1-ex2-commands/#questions","text":"When you run condor_status , how many 64-bit Linux slots are available? (Hint: Unclaimed = available.) What percent of the total slots are currently claimed by a job? (Note: there is a rapid turnover of slots, which is what allows users with new submission to have jobs start quickly.) How have these numbers changed (if at all) when you run the condor_status command again?","title":"Questions:"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-whole-machines-only","text":"Also try out the -compact for a slightly different view of whole machines (i.e. server hostnames), without the individual slots shown. username@ap1 $ condor_status -compact How has the column information changed?","title":"Viewing Whole Machines, Only"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-jobs","text":"The condor_q command lists jobs that are on this access point machine and that are running or waiting to run. The _q part of the name is meant to suggest the word \u201cqueue\u201d, or list of job sets waiting to finish.","title":"Viewing Jobs"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-your-own-jobs","text":"The default behavior of the command lists only your jobs: username@ap1 $ condor_q The main part of the output (which will be empty, because you haven't submitted jobs yet) shows one set (\"batch\") of submitted jobs per line. If you had a single job in the queue, it would look something like the below: -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 09:59:31 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice CMD: run_ffmpeg.sh 7/12 09:58 _ _ 1 1 18801.0 This output consists of 8 (or 9) columns: Col Example Meaning OWNER alice The user ID of the user who submitted the job BATCH_NAME run_ffmpeg.sh The executable or \"jobbatchname\" specified within the submit file(s) SUBMITTED 7/12 09:58 The date and time when the job was submitted DONE _ Number of jobs in this batch that have completed RUN _ Number of jobs in this batch that are currently running IDLE 1 Number of jobs in this batch that are idle, waiting for a match HOLD _ Column will show up if there are jobs on \"hold\" because something about the submission/setup needs to be corrected by the user TOTAL 1 Total number of jobs in this batch JOB_IDS 18801.0 Job ID or range of Job IDs in this batch At the end of the job listing, there is a summary. Here is a sample: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended It shows total counts of jobs in the different possible states. Questions: For the sample above, when was the job submitted? For the sample above, was the job running or not yet? How can you tell?","title":"Viewing Your Own Jobs"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-everyones-jobs","text":"By default, the condor_q command shows your jobs only. To see everyone\u2019s jobs that are queued on the machine, add the -all option: username@ap1 $ condor_q -all How many jobs are queued in total (i.e., running or waiting to run)? How many jobs from this submit machine are running right now?","title":"Viewing Everyone\u2019s Jobs"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-jobs-without-the-default-batch-mode","text":"The condor_q output, by default, groups \"batches\" of jobs together (if they were submitted with the same submit file or \"jobbatchname\"). To see more information for EVERY job on a separate line of output, use the -nobatch option to condor_q : username@ap1 $ condor_q -all -nobatch How has the column information changed? (Below is an example of the top of the output.) -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 11:58:44 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18203.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18204.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18801.0 alice 7/12 09:58 0+00:00:00 I 0 0.0 run_ffmpeg.sh 18997.0 s16_martincum 7/12 10:59 0+00:00:32 I 0 733.0 runR.pl 1_0 run_perm.R 1 0 10 19027.5 s16_martincum 7/12 11:06 0+00:09:20 I 0 2198.0 runR.pl 1_5 run_perm.R 1 5 1000 The -nobatch output shows a line for every job and consists of 8 columns: Col Example Meaning ID 18801.0 Job ID, which is the cluster , a dot character ( . ), and the process OWNER alice The user ID of the user who submitted the job SUBMITTED 7/12 09:58 The date and time when the job was submitted RUN_TIME 0+00:00:00 Total time spent running so far (days + hours:minutes:seconds) ST I Status of job: I is Idle (waiting to run), R is Running, H is Held, etc. PRI 0 Job priority (see next lecture) SIZE 0.0 Current run-time memory usage, in MB CMD run_ffmpeg.sh The executable command (with arguments) to be run In future exercises, you'll want to switch between condor_q and condor_q -nobatch to see different types of information about YOUR jobs.","title":"Viewing Jobs without the Default \"batch\" Mode"},{"location":"materials/htcondor/part1-ex2-commands/#extra-information","text":"Both condor_status and condor_q have many command-line options, some of which significantly change their output. You will explore a few of the most useful options in future exercises, but if you want to experiment now, go ahead! There are a few ways to learn more about the commands: Use the (brief) built-in help for the commands, e.g.: condor_q -h Read the installed man(ual) pages for the commands, e.g.: man condor_q Find the command in the online manual ; note: the text online is the same as the man text, only formatted for the web","title":"Extra Information"},{"location":"materials/htcondor/part1-ex3-jobs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.3: Run Jobs! \u00b6 Exercise Goal \u00b6 The goal of this exercise is to submit jobs to HTCondor and have them run on the PATh Facility. This is a huge step in learning to use an HTC system! This exercise will take longer than the first two, short ones. If you are having any problems getting the jobs to run, please ask the instructors! It is very important that you know how to run jobs. Running Your First Job \u00b6 Nearly all of the time, when you want to run an HTCondor job, you first write an HTCondor submit file for it. In this section, you will run the same hostname command as in Exercise 1.1, but where this command will run within a job on one of the 'execute' servers on the PATh Facility's HTCondor pool. First, create an example submit file called hostname.sub using your favorite text editor (e.g., nano , vim ) and then transfer the following information to that file: executable = /bin/hostname output = hostname.out error = hostname.err log = hostname.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue Save your submit file using the name hostname.sub . Note You can name the HTCondor submit file using any filename. It's a good practice to always include the .sub extension, but it is not required. This is because the submit file is a simple text file that we are using to pass information to HTCondor. The lines of the submit file have the following meanings: Submit Command Explanation executable The name of the program to run (relative to the directory from which you submit). output The filename where HTCondor will write the standard output from your job. error The filename where HTCondor will write the standard error from your job. This particular job is not likely to have any, but it is best to include this line for every job. log The filename where HTCondor will write information about your job run. While not required, it is a really good idea to have a log file for every job. request_* Tells HTCondor how many cpus and how much memory and disk we want, which is not much, because the 'hostname' executable is very small. queue Tells HTCondor to run your job with the settings above. Note that we are not using the arguments or transfer_input_files lines that were mentioned during lecture because the hostname program is all that needs to be transferred from the access point server, and we want to run it without any additional options. Double-check your submit file, so that it matches the text above. Then, tell HTCondor to run your job: username@ap1 $ condor_submit hostname.sub Submitting job(s). 1 job(s) submitted to cluster NNNN. The actual cluster number will be shown instead of NNNN . If, instead of the text above, there are error messages, read them carefully and then try to correct your submit file or ask for help. Notice that condor_submit returns back to the shell prompt right away. It does not wait for your job to run. Instead, as soon as it has finished submitting your job into the queue, the submit command finishes. View your job in the queue \u00b6 Now, use condor_q and condor_q -nobatch to watch for your job in the queue! You may not even catch the job in the R running state, because the hostname command runs very quickly. When the job itself is finished, it will 'leave' the queue and no longer be listed in the condor_q output. After the job finishes, check for the hostname output in hostname.out , which is where job information printed to the terminal screen will be printed for the job. username@ap1 $ cat hostname.out e171.chtc.wisc.edu The hostname.err file should be empty, unless there were issues running the hostname executable after it was transferred to the slot. The hostname.log is more complex and will be the focus of a later exercise. Running a Job With Arguments \u00b6 Very often, when you run a command on the command line, it includes arguments (i.e. options) after the program name, as in the below examples: username@ap1 $ sleep 60 In an HTCondor submit file, the program (or 'executable') name goes in the executable statement and all remaining arguments go into an arguments statement. For example, if the full command is: username@ap1 $ sleep 60 Then in the submit file, we would put the location of the \"sleep\" program (you can find it with which sleep ) as the job executable , and 60 as the job arguments : executable = /bin/sleep arguments = 60 Let\u2019s try a job submission with arguments. We will use the sleep command shown above, which does nothing (i.e., puts the job to sleep) for the specified number of seconds, then exits normally. It is convenient for simulating a job that takes a while to run. Create a new submit file and save the following text in it. executable = /bin/sleep arguments = 60 output = sleep.out error = sleep.err log = sleep.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue You can save the file using any name, but as a reminder, we recommend it uses the .sub file extension. Except for changing a few filenames, this submit file is nearly identical to the last one, except for the addition of the arguments line. Submit this new job to HTCondor. Again, watch for it to run using condor_q and condor_q -nobatch ; check once every 15 seconds or so. Once the job starts running, it will take about 1 minute to run (reminder: the sleep command is telling the job to do nothing for 60 seconds), so you should be able to see it running for a bit. When the job finishes, it will disappear from the queue, but there will be no output in the output or error files, because sleep does not produce any output. Running a Script Job From the Submit Directory \u00b6 So far, we have been running programs (executables) that come with the standard Linux system. More frequently, you will want to run a program that exists within your directory or perhaps a shell script of commands that you'd like to run within a job. In this example, you will write a shell script and a submit file that runs the shell script within a job: Put the following contents into a file named test-script.sh : #!/bin/sh # START echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'System: ' ` uname -spo ` echo \"Program: $0 \" echo \"Args: $* \" echo 'ls: ' ` ls ` # END Add executable permissions to the file (so that it can be run as a program): username@ap1 $ chmod +x test-script.sh Test your script from the command line: username@ap1 $ ./test-script.sh hello 42 Date: Mon Jul 1 14:03:56 CDT 2024 Host: path-ap2001 System: Linux x86_64 GNU/Linux Program: ./test-script.sh Args: hello 42 ls: hostname.err hostname.log hostname.out hostname.sub sleep.log sleep.sub test-script.sh This step is really important! If you cannot run your executable from the command-line, HTCondor probably cannot run it on another machine, either. Further, debugging problems like this one is surprisingly difficult. So, if possible, test your executable and arguments as a command at the command-line first. Write the submit file (this should be getting easier by now): executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue In this example, the executable that was named in the submit file did not start with a / , so the location of the file is relative to the submit directory itself. In other words, in this format the executable must be in the same directory as the submit file. Note Blank lines between commands and spaces around the = do not matter to HTCondor. For example, this submit file is equivalent to the one above: executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus=1 request_memory=1GB request_disk=1GB queue Use whitespace to make things clear to you , the user. Submit the job, wait for it to finish, and check the standard output file (and standard error file, which should be empty). What do you notice about the lines returned for \"Program\" and \"ls\"? Remember that only files pertaining to this job will be in the job working directory on the execute point server. You're also seeing the effects of HTCondor's need to standardize some filenames when running your job, though they are named as you expect in the submission directory (per the submit file contents). Extra Challenge \u00b6 Note There are Extra Challenges throughout the school curriculum. You may be better off coming back to these after you've completed all other exercises for your current working session. Below is a Python script that does something similar to the shell script above. Run this Python script using HTCondor. #!/usr/bin/env python3 \"\"\"Extra Challenge for OSG School Written by Tim Cartwright Submitted to CHTC by #YOUR_NAME# \"\"\" import getpass import os import platform import socket import sys import time arguments = None if len ( sys . argv ) > 1 : arguments = '\"' + ' ' . join ( sys . argv [ 1 :]) + '\"' print ( __doc__ , file = sys . stderr ) print ( 'Time :' , time . strftime ( '%Y-%m- %d ( %a ) %H:%M:%S %Z' )) print ( 'Host :' , getpass . getuser (), '@' , socket . gethostname ()) uname = platform . uname () print ( \"System :\" , uname [ 0 ], uname [ 2 ], uname [ 4 ]) print ( \"Version :\" , platform . python_version ()) print ( \"Program :\" , sys . executable ) print ( 'Script :' , os . path . abspath ( __file__ )) print ( 'Args :' , arguments )","title":"1.3 - Run jobs!"},{"location":"materials/htcondor/part1-ex3-jobs/#htc-exercise-13-run-jobs","text":"","title":"HTC Exercise 1.3: Run Jobs!"},{"location":"materials/htcondor/part1-ex3-jobs/#exercise-goal","text":"The goal of this exercise is to submit jobs to HTCondor and have them run on the PATh Facility. This is a huge step in learning to use an HTC system! This exercise will take longer than the first two, short ones. If you are having any problems getting the jobs to run, please ask the instructors! It is very important that you know how to run jobs.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex3-jobs/#running-your-first-job","text":"Nearly all of the time, when you want to run an HTCondor job, you first write an HTCondor submit file for it. In this section, you will run the same hostname command as in Exercise 1.1, but where this command will run within a job on one of the 'execute' servers on the PATh Facility's HTCondor pool. First, create an example submit file called hostname.sub using your favorite text editor (e.g., nano , vim ) and then transfer the following information to that file: executable = /bin/hostname output = hostname.out error = hostname.err log = hostname.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue Save your submit file using the name hostname.sub . Note You can name the HTCondor submit file using any filename. It's a good practice to always include the .sub extension, but it is not required. This is because the submit file is a simple text file that we are using to pass information to HTCondor. The lines of the submit file have the following meanings: Submit Command Explanation executable The name of the program to run (relative to the directory from which you submit). output The filename where HTCondor will write the standard output from your job. error The filename where HTCondor will write the standard error from your job. This particular job is not likely to have any, but it is best to include this line for every job. log The filename where HTCondor will write information about your job run. While not required, it is a really good idea to have a log file for every job. request_* Tells HTCondor how many cpus and how much memory and disk we want, which is not much, because the 'hostname' executable is very small. queue Tells HTCondor to run your job with the settings above. Note that we are not using the arguments or transfer_input_files lines that were mentioned during lecture because the hostname program is all that needs to be transferred from the access point server, and we want to run it without any additional options. Double-check your submit file, so that it matches the text above. Then, tell HTCondor to run your job: username@ap1 $ condor_submit hostname.sub Submitting job(s). 1 job(s) submitted to cluster NNNN. The actual cluster number will be shown instead of NNNN . If, instead of the text above, there are error messages, read them carefully and then try to correct your submit file or ask for help. Notice that condor_submit returns back to the shell prompt right away. It does not wait for your job to run. Instead, as soon as it has finished submitting your job into the queue, the submit command finishes.","title":"Running Your First Job"},{"location":"materials/htcondor/part1-ex3-jobs/#view-your-job-in-the-queue","text":"Now, use condor_q and condor_q -nobatch to watch for your job in the queue! You may not even catch the job in the R running state, because the hostname command runs very quickly. When the job itself is finished, it will 'leave' the queue and no longer be listed in the condor_q output. After the job finishes, check for the hostname output in hostname.out , which is where job information printed to the terminal screen will be printed for the job. username@ap1 $ cat hostname.out e171.chtc.wisc.edu The hostname.err file should be empty, unless there were issues running the hostname executable after it was transferred to the slot. The hostname.log is more complex and will be the focus of a later exercise.","title":"View your job in the queue"},{"location":"materials/htcondor/part1-ex3-jobs/#running-a-job-with-arguments","text":"Very often, when you run a command on the command line, it includes arguments (i.e. options) after the program name, as in the below examples: username@ap1 $ sleep 60 In an HTCondor submit file, the program (or 'executable') name goes in the executable statement and all remaining arguments go into an arguments statement. For example, if the full command is: username@ap1 $ sleep 60 Then in the submit file, we would put the location of the \"sleep\" program (you can find it with which sleep ) as the job executable , and 60 as the job arguments : executable = /bin/sleep arguments = 60 Let\u2019s try a job submission with arguments. We will use the sleep command shown above, which does nothing (i.e., puts the job to sleep) for the specified number of seconds, then exits normally. It is convenient for simulating a job that takes a while to run. Create a new submit file and save the following text in it. executable = /bin/sleep arguments = 60 output = sleep.out error = sleep.err log = sleep.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue You can save the file using any name, but as a reminder, we recommend it uses the .sub file extension. Except for changing a few filenames, this submit file is nearly identical to the last one, except for the addition of the arguments line. Submit this new job to HTCondor. Again, watch for it to run using condor_q and condor_q -nobatch ; check once every 15 seconds or so. Once the job starts running, it will take about 1 minute to run (reminder: the sleep command is telling the job to do nothing for 60 seconds), so you should be able to see it running for a bit. When the job finishes, it will disappear from the queue, but there will be no output in the output or error files, because sleep does not produce any output.","title":"Running a Job With Arguments"},{"location":"materials/htcondor/part1-ex3-jobs/#running-a-script-job-from-the-submit-directory","text":"So far, we have been running programs (executables) that come with the standard Linux system. More frequently, you will want to run a program that exists within your directory or perhaps a shell script of commands that you'd like to run within a job. In this example, you will write a shell script and a submit file that runs the shell script within a job: Put the following contents into a file named test-script.sh : #!/bin/sh # START echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'System: ' ` uname -spo ` echo \"Program: $0 \" echo \"Args: $* \" echo 'ls: ' ` ls ` # END Add executable permissions to the file (so that it can be run as a program): username@ap1 $ chmod +x test-script.sh Test your script from the command line: username@ap1 $ ./test-script.sh hello 42 Date: Mon Jul 1 14:03:56 CDT 2024 Host: path-ap2001 System: Linux x86_64 GNU/Linux Program: ./test-script.sh Args: hello 42 ls: hostname.err hostname.log hostname.out hostname.sub sleep.log sleep.sub test-script.sh This step is really important! If you cannot run your executable from the command-line, HTCondor probably cannot run it on another machine, either. Further, debugging problems like this one is surprisingly difficult. So, if possible, test your executable and arguments as a command at the command-line first. Write the submit file (this should be getting easier by now): executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue In this example, the executable that was named in the submit file did not start with a / , so the location of the file is relative to the submit directory itself. In other words, in this format the executable must be in the same directory as the submit file. Note Blank lines between commands and spaces around the = do not matter to HTCondor. For example, this submit file is equivalent to the one above: executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus=1 request_memory=1GB request_disk=1GB queue Use whitespace to make things clear to you , the user. Submit the job, wait for it to finish, and check the standard output file (and standard error file, which should be empty). What do you notice about the lines returned for \"Program\" and \"ls\"? Remember that only files pertaining to this job will be in the job working directory on the execute point server. You're also seeing the effects of HTCondor's need to standardize some filenames when running your job, though they are named as you expect in the submission directory (per the submit file contents).","title":"Running a Script Job From the Submit Directory"},{"location":"materials/htcondor/part1-ex3-jobs/#extra-challenge","text":"Note There are Extra Challenges throughout the school curriculum. You may be better off coming back to these after you've completed all other exercises for your current working session. Below is a Python script that does something similar to the shell script above. Run this Python script using HTCondor. #!/usr/bin/env python3 \"\"\"Extra Challenge for OSG School Written by Tim Cartwright Submitted to CHTC by #YOUR_NAME# \"\"\" import getpass import os import platform import socket import sys import time arguments = None if len ( sys . argv ) > 1 : arguments = '\"' + ' ' . join ( sys . argv [ 1 :]) + '\"' print ( __doc__ , file = sys . stderr ) print ( 'Time :' , time . strftime ( '%Y-%m- %d ( %a ) %H:%M:%S %Z' )) print ( 'Host :' , getpass . getuser (), '@' , socket . gethostname ()) uname = platform . uname () print ( \"System :\" , uname [ 0 ], uname [ 2 ], uname [ 4 ]) print ( \"Version :\" , platform . python_version ()) print ( \"Program :\" , sys . executable ) print ( 'Script :' , os . path . abspath ( __file__ )) print ( 'Args :' , arguments )","title":"Extra Challenge"},{"location":"materials/htcondor/part1-ex4-logs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.4: Read and Interpret Log Files \u00b6 Exercise Goal \u00b6 The goal of this exercise is to learn how to understand the contents of a job's log file, which is essentially a \"history\" of the steps HTCondor took to run your job. If you suspect something has gone wrong with your job, the log is the a great place to start looking for indications of whether things might have gone wrong (in addition to the .err file). This exercise is short, but you'll want to at least read over it before moving on. Reading a Log File \u00b6 For this exercise, we can examine a log file for any previous job that you have run. The example output below is based on the sleep 60 job. A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are all of the event headings from the sleep job log (detailed output in between headings has been omitted here): 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> 040 (5739.000.000) 2024-07-10 10:45:10 Started transferring input files 040 (5739.000.000) 2024-07-10 10:45:10 Finished transferring input files 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 040 (5739.000.000) 2024-07-10 10:45:20 Started transferring output files 040 (5739.000.000) 2024-07-10 10:45:20 Finished transferring output files 006 (5739.000.000) 2024-07-10 10:46:11 Image size of job updated: 4072 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. There is a lot of extra information in those lines, but you can see: The job ID: cluster 5739, process 0 (written 000 ) The date and local time of each event A brief description of the event: submission, execution, some information updates, and termination Some events provide no information in addition to the heading. For example: 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> ... Note Each event ends with a line that contains only 3 dots: ... However, some lines have additional information to help you quickly understand where and how your job is running. For example: 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> SlotName: slot1@WISC-PATH-IDPL-EP.osgvo-docker-pilot-idpl-7c6575d494-2sj5w CondorScratchDir = \"/pilot/osgvo-pilot-2q71K9/execute/dir_9316\" Cpus = 1 Disk = 174321444 GLIDEIN_ResourceName = \"WISC-PATH-IDPL-EP\" GPUs = 0 Memory = 8192 ... The SlotName is the name of the execution point slot your job was assigned to by HTCondor, and the name of the execution point resource is provided in GLIDEIN_ResourceName The CondorScratchDir is the name of the scratch directory that was created by HTCondor for your job to run inside The Cpu , GPUs , Disk , and Memory values provide the maximum amount of each resource your job has used while running Another example of is the periodic update: 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 1 - MemoryUsage of job (MB) 72 - ResidentSetSize of job (KB) ... These updates record the amount of memory that the job is using on the execute machine. This can be helpful information, so that in future runs of the job, you can tell HTCondor how much memory you will need. The job termination event includes a lot of very useful information: 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 27848 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 27848 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 40 30 4203309 Memory (MB) : 1 1 1 Job terminated of its own accord at 2024-07-10 10:46:11 with exit-code 0. ... Probably the most interesting information is: The return value or exit code ( 0 here, means the executable completed and didn't indicate any internal errors; non-zero usually means failure) The total number of bytes transferred each way, which could be useful if your network is slow The Partitionable Resources table, especially disk and memory usage, which will inform larger submissions. There are many other kinds of events, but the ones above will occur in almost every job log. Understanding When Job Log Events Are Written \u00b6 When are events written to the job log file? Let\u2019s find out. Read through the entire procedure below before starting, because some parts of the process are time sensitive. Change the sleep job submit file, so that the job sleeps for 2 minutes (= 120 seconds) Submit the updated sleep job As soon as the condor_submit command finishes, hit the return key a few times, to create some blank lines Right away, run a command to show the log file and keep showing updates as they occur: username@ap1 $ tail -f sleep.log Watch the output carefully. When do events appear in the log file? After the termination event appears, press Control-C to end the tail command and return to the shell prompt. Understanding How HTCondor Writes Files \u00b6 When HTCondor writes the output, error, and log files, does it erase the previous contents of the file or does it add new lines onto the end? Let\u2019s find out! For this exercise, we can use the hostname job from earlier. Edit the hostname submit file so that it uses new and unique filenames for output, error, and log files. Alternatively, delete any existing output, error, and log files from previous runs of the hostname job. Submit the job three separate times in a row (there are better ways to do this, which we will cover in the next lecture) Wait for all three jobs to finish Examine the output file: How many hostnames are there? Did HTCondor erase the previous contents for each job, or add new lines? Examine the log file\u2026 carefully: What happened there? Pay close attention to the times and job IDs of the events. For further clarification about how HTCondor handles these files, reach out to your mentor or one of the other school staff.","title":"1.4 - Read and interpret log files"},{"location":"materials/htcondor/part1-ex4-logs/#htc-exercise-14-read-and-interpret-log-files","text":"","title":"HTC Exercise 1.4: Read and Interpret Log Files"},{"location":"materials/htcondor/part1-ex4-logs/#exercise-goal","text":"The goal of this exercise is to learn how to understand the contents of a job's log file, which is essentially a \"history\" of the steps HTCondor took to run your job. If you suspect something has gone wrong with your job, the log is the a great place to start looking for indications of whether things might have gone wrong (in addition to the .err file). This exercise is short, but you'll want to at least read over it before moving on.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex4-logs/#reading-a-log-file","text":"For this exercise, we can examine a log file for any previous job that you have run. The example output below is based on the sleep 60 job. A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are all of the event headings from the sleep job log (detailed output in between headings has been omitted here): 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> 040 (5739.000.000) 2024-07-10 10:45:10 Started transferring input files 040 (5739.000.000) 2024-07-10 10:45:10 Finished transferring input files 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 040 (5739.000.000) 2024-07-10 10:45:20 Started transferring output files 040 (5739.000.000) 2024-07-10 10:45:20 Finished transferring output files 006 (5739.000.000) 2024-07-10 10:46:11 Image size of job updated: 4072 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. There is a lot of extra information in those lines, but you can see: The job ID: cluster 5739, process 0 (written 000 ) The date and local time of each event A brief description of the event: submission, execution, some information updates, and termination Some events provide no information in addition to the heading. For example: 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> ... Note Each event ends with a line that contains only 3 dots: ... However, some lines have additional information to help you quickly understand where and how your job is running. For example: 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> SlotName: slot1@WISC-PATH-IDPL-EP.osgvo-docker-pilot-idpl-7c6575d494-2sj5w CondorScratchDir = \"/pilot/osgvo-pilot-2q71K9/execute/dir_9316\" Cpus = 1 Disk = 174321444 GLIDEIN_ResourceName = \"WISC-PATH-IDPL-EP\" GPUs = 0 Memory = 8192 ... The SlotName is the name of the execution point slot your job was assigned to by HTCondor, and the name of the execution point resource is provided in GLIDEIN_ResourceName The CondorScratchDir is the name of the scratch directory that was created by HTCondor for your job to run inside The Cpu , GPUs , Disk , and Memory values provide the maximum amount of each resource your job has used while running Another example of is the periodic update: 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 1 - MemoryUsage of job (MB) 72 - ResidentSetSize of job (KB) ... These updates record the amount of memory that the job is using on the execute machine. This can be helpful information, so that in future runs of the job, you can tell HTCondor how much memory you will need. The job termination event includes a lot of very useful information: 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 27848 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 27848 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 40 30 4203309 Memory (MB) : 1 1 1 Job terminated of its own accord at 2024-07-10 10:46:11 with exit-code 0. ... Probably the most interesting information is: The return value or exit code ( 0 here, means the executable completed and didn't indicate any internal errors; non-zero usually means failure) The total number of bytes transferred each way, which could be useful if your network is slow The Partitionable Resources table, especially disk and memory usage, which will inform larger submissions. There are many other kinds of events, but the ones above will occur in almost every job log.","title":"Reading a Log File"},{"location":"materials/htcondor/part1-ex4-logs/#understanding-when-job-log-events-are-written","text":"When are events written to the job log file? Let\u2019s find out. Read through the entire procedure below before starting, because some parts of the process are time sensitive. Change the sleep job submit file, so that the job sleeps for 2 minutes (= 120 seconds) Submit the updated sleep job As soon as the condor_submit command finishes, hit the return key a few times, to create some blank lines Right away, run a command to show the log file and keep showing updates as they occur: username@ap1 $ tail -f sleep.log Watch the output carefully. When do events appear in the log file? After the termination event appears, press Control-C to end the tail command and return to the shell prompt.","title":"Understanding When Job Log Events Are Written"},{"location":"materials/htcondor/part1-ex4-logs/#understanding-how-htcondor-writes-files","text":"When HTCondor writes the output, error, and log files, does it erase the previous contents of the file or does it add new lines onto the end? Let\u2019s find out! For this exercise, we can use the hostname job from earlier. Edit the hostname submit file so that it uses new and unique filenames for output, error, and log files. Alternatively, delete any existing output, error, and log files from previous runs of the hostname job. Submit the job three separate times in a row (there are better ways to do this, which we will cover in the next lecture) Wait for all three jobs to finish Examine the output file: How many hostnames are there? Did HTCondor erase the previous contents for each job, or add new lines? Examine the log file\u2026 carefully: What happened there? Pay close attention to the times and job IDs of the events. For further clarification about how HTCondor handles these files, reach out to your mentor or one of the other school staff.","title":"Understanding How HTCondor Writes Files"},{"location":"materials/htcondor/part1-ex5-request/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.5: Declare Resource Needs \u00b6 The goal of this exercise is to demonstrate how to test and tune the request_X statements in a submit file for when you don't know what resources your job needs. There are three special resource request statements that you can use (optionally) in an HTCondor submit file: request_cpus for the number of CPUs your job will use. A value of \"1\" is always a great starting point, but some software can use more than \"1\" (however, most softwares will use an argument to control this number). request_memory for the maximum amount of run-time memory your job may use. request_disk for the maximum amount of disk space your job may use (including the executable and all other data that may show up during the job). HTCondor defaults to certain reasonable values for these request settings, so you do not need to use them to get small jobs to run. However, it is in YOUR best interest to always estimate resource requests before submitting any job, and to definitely tune your requests before submitting multiple jobs. In many HTCondor pools: If your job goes over the request values, it may be removed from the execute machine and held (status 'H' in the condor_q output, awaiting action on your part) without saving any partial job output files. So it is a disadvantage to not declare your resource needs or if you underestimate them. Conversely, if you overestimate them by too much, your jobs will match to fewer slots and take longer to match to a slot to begin running. Additionally, by hogging up resources that you don't need, other users may be deprived of the resources they require. In the long run, it works better for all users of the pool if you declare what you really need. But how do you know what to request? In particular, we are concerned with memory and disk here; requesting multiple CPUs and using them is covered a bit in later school materials, but true HTC splits work up into jobs that each use as few CPU cores as possible (one CPU core is always best to have the most jobs running). Determining Resource Needs Before Running Any Jobs \u00b6 Note If you are running short on time, you can skip to \"Determining Resource Needs By Running Test Jobs\", below, but try to come back and read over this part at some point. It can be very difficult to predict the memory needs of your running program without running tests. Typically, the memory size of a job changes over time, making the task even trickier. If you have knowledge ahead of time about your job\u2019s maximum memory needs, use that, or maybe a number that's just a bit higher, to ensure your job has enough memory to complete. If this is your first time running your job, you can request a fairly large amount of memory (as high as what's on your laptop or other server, if you know your program can run without crashing) for a first test job, OR you can run the program locally and \"watch\" it: Examining a Running Program on a Local Computer \u00b6 When working on a shared access point, you should not run computationally-intensive work because it can use resources needed by HTCondor to manage the queue for all uses. However, you may have access to other computers (your laptop, for example, or another server) where you can observe the memory usage of a program. The downside is that you'll have to watch a program run for essentially the entire time, to make sure you catch the maximum memory usage. For Memory: \u00b6 On Mac and Windows, for example, the \"Activity Monitor\" and \"Task Manager\" applications may be useful. On a Mac or Linux system, you can use the ps command or the top command in the Terminal to watch a running program and see (roughly) how much memory it is using. Full coverage of these tools is beyond the scope of this exercise, but here are two quick examples: Using ps : username@ap1 $ ps ux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND alice 24342 0.0 0.0 90224 1864 ? S 13:39 0:00 sshd: alice@pts/0 alice 24343 0.0 0.0 66096 1580 pts/0 Ss 13:39 0:00 -bash alice 25864 0.0 0.0 65624 996 pts/0 R+ 13:52 0:00 ps ux alice 30052 0.0 0.0 90720 2456 ? S Jun22 0:00 sshd: alice@pts/2 alice 30053 0.0 0.0 66096 1624 pts/2 Ss+ Jun22 0:00 -bash The Resident Set Size ( RSS ) column, highlighted above, gives a rough indication of the memory usage (in KB) of each running process. If your program runs long enough, you can run this command several times and note the greatest value. Using top : username@ap1 $ top -u top - 13:55:31 up 11 days, 20:59, 5 users, load average: 0.12, 0.12, 0.09 Tasks: 198 total, 1 running, 197 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 0.1%sy, 0.0%ni, 98.5%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 4001440k total, 3558028k used, 443412k free, 258568k buffers Swap: 4194296k total, 148k used, 4194148k free, 2960760k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24342 alice 15 0 90224 1864 1096 S 0.0 0.0 0:00.26 sshd 24343 alice 15 0 66096 1580 1232 S 0.0 0.0 0:00.07 bash 25927 alice 15 0 12760 1196 836 R 0.0 0.0 0:00.01 top 30052 alice 16 0 90720 2456 1112 S 0.0 0.1 0:00.69 sshd 30053 alice 18 0 66096 1624 1236 S 0.0 0.0 0:00.37 bash The top command (shown here with an option to limit the output to a single user ID) also shows information about running processes, but updates periodically by itself. Type the letter q to quit the interactive display. Again, the highlighted RES column shows an approximation of memory usage. For Disk: \u00b6 Determining disk needs may be a bit easier, because you can check on the size of files that a program is using while it runs. However, it is important to count all files that HTCondor counts to get an accurate size. HTCondor counts everything in your job sandbox toward your job\u2019s disk usage: The executable itself All \"input\" files (anything else that gets transferred TO the job, even if you don't think of it as \"input\") All files created during the job (broadly defined as \"output\"), including the captured standard output and error files that you list in the submit file. All temporary files created in the sandbox, even if they get deleted by the executable before it's done. If you can run your program within a single directory on a local computer (not on the access point), you should be able to view files and their sizes with the ls and du commands. Determining Resource Needs By Running Test Jobs (BEST) \u00b6 Despite the techniques mentioned above, by far the easiest approach to measuring your job\u2019s resource needs is to run one or a small number of sample jobs and have HTCondor itself tell you about the resources used during the runs. For example, here is a strange Python script that does not do anything useful, but consumes some real resources while running: #!/usr/bin/env python3 import time import os size = 1000000 numbers = [] for i in range ( size ): numbers . append ( str ( i )) with open ( 'numbers.txt' , 'w' ) as tempfile : tempfile . write ( ' ' . join ( numbers )) time . sleep ( 60 ) Without trying to figure out what this code does or how many resources it uses, create a submit file for it, and run it once with HTCondor, starting with somewhat high memory requests (\"1GB\" for memory and disk is a good starting point, unless you think the job will use far more). When it is done, examine the log file. In particular, we care about these lines: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 6739 1048576 8022934 Memory (MB) : 3 1024 1024 So, now we know that HTCondor saw that the job used 6,739 KB of disk (= about 6.5 MB) and 3 MB of memory! This is a great technique for determining the real resource needs of your job. If you think resource needs vary from run to run, submit a few sample jobs and look at all the results. You should round up your resource requests a little, just in case your job occasionally uses more resources. Setting Resource Requirements \u00b6 Once you know your job\u2019s resource requirements, it is easy to declare them in your submit file. For example, taking our results above as an example, we might slightly increase our requests above what was used, just to be safe: # rounded up from 3 MB request_memory = 4MB # rounded up from 6.5 MB request_disk = 7MB Pay close attention to units: Without explicit units, request_memory is in MB (megabytes) Without explicit units, request_disk is in KB (kilobytes) Allowable units are KB (kilobytes), MB (megabytes), GB (gigabytes), and TB (terabytes) HTCondor translates these requirements into attributes that become part of the job's requirements expression. However, do not put your CPU, memory, and disk requirements directly into the requirements expression; use the request_XXX statements instead. If you still have time in this working session, Add these requirements to your submit file for the Python script, rerun the job, and confirm in the log file that your requests were used. After changing the requirements in your submit file, did your job run successfully? If not, why? (Hint: HTCondor polls a job's resource use on a timer. How long are these jobs running for?)","title":"1.5 - Determining resource needs"},{"location":"materials/htcondor/part1-ex5-request/#htc-exercise-15-declare-resource-needs","text":"The goal of this exercise is to demonstrate how to test and tune the request_X statements in a submit file for when you don't know what resources your job needs. There are three special resource request statements that you can use (optionally) in an HTCondor submit file: request_cpus for the number of CPUs your job will use. A value of \"1\" is always a great starting point, but some software can use more than \"1\" (however, most softwares will use an argument to control this number). request_memory for the maximum amount of run-time memory your job may use. request_disk for the maximum amount of disk space your job may use (including the executable and all other data that may show up during the job). HTCondor defaults to certain reasonable values for these request settings, so you do not need to use them to get small jobs to run. However, it is in YOUR best interest to always estimate resource requests before submitting any job, and to definitely tune your requests before submitting multiple jobs. In many HTCondor pools: If your job goes over the request values, it may be removed from the execute machine and held (status 'H' in the condor_q output, awaiting action on your part) without saving any partial job output files. So it is a disadvantage to not declare your resource needs or if you underestimate them. Conversely, if you overestimate them by too much, your jobs will match to fewer slots and take longer to match to a slot to begin running. Additionally, by hogging up resources that you don't need, other users may be deprived of the resources they require. In the long run, it works better for all users of the pool if you declare what you really need. But how do you know what to request? In particular, we are concerned with memory and disk here; requesting multiple CPUs and using them is covered a bit in later school materials, but true HTC splits work up into jobs that each use as few CPU cores as possible (one CPU core is always best to have the most jobs running).","title":"HTC Exercise 1.5: Declare Resource Needs"},{"location":"materials/htcondor/part1-ex5-request/#determining-resource-needs-before-running-any-jobs","text":"Note If you are running short on time, you can skip to \"Determining Resource Needs By Running Test Jobs\", below, but try to come back and read over this part at some point. It can be very difficult to predict the memory needs of your running program without running tests. Typically, the memory size of a job changes over time, making the task even trickier. If you have knowledge ahead of time about your job\u2019s maximum memory needs, use that, or maybe a number that's just a bit higher, to ensure your job has enough memory to complete. If this is your first time running your job, you can request a fairly large amount of memory (as high as what's on your laptop or other server, if you know your program can run without crashing) for a first test job, OR you can run the program locally and \"watch\" it:","title":"Determining Resource Needs Before Running Any Jobs"},{"location":"materials/htcondor/part1-ex5-request/#examining-a-running-program-on-a-local-computer","text":"When working on a shared access point, you should not run computationally-intensive work because it can use resources needed by HTCondor to manage the queue for all uses. However, you may have access to other computers (your laptop, for example, or another server) where you can observe the memory usage of a program. The downside is that you'll have to watch a program run for essentially the entire time, to make sure you catch the maximum memory usage.","title":"Examining a Running Program on a Local Computer"},{"location":"materials/htcondor/part1-ex5-request/#for-memory","text":"On Mac and Windows, for example, the \"Activity Monitor\" and \"Task Manager\" applications may be useful. On a Mac or Linux system, you can use the ps command or the top command in the Terminal to watch a running program and see (roughly) how much memory it is using. Full coverage of these tools is beyond the scope of this exercise, but here are two quick examples: Using ps : username@ap1 $ ps ux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND alice 24342 0.0 0.0 90224 1864 ? S 13:39 0:00 sshd: alice@pts/0 alice 24343 0.0 0.0 66096 1580 pts/0 Ss 13:39 0:00 -bash alice 25864 0.0 0.0 65624 996 pts/0 R+ 13:52 0:00 ps ux alice 30052 0.0 0.0 90720 2456 ? S Jun22 0:00 sshd: alice@pts/2 alice 30053 0.0 0.0 66096 1624 pts/2 Ss+ Jun22 0:00 -bash The Resident Set Size ( RSS ) column, highlighted above, gives a rough indication of the memory usage (in KB) of each running process. If your program runs long enough, you can run this command several times and note the greatest value. Using top : username@ap1 $ top -u top - 13:55:31 up 11 days, 20:59, 5 users, load average: 0.12, 0.12, 0.09 Tasks: 198 total, 1 running, 197 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 0.1%sy, 0.0%ni, 98.5%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 4001440k total, 3558028k used, 443412k free, 258568k buffers Swap: 4194296k total, 148k used, 4194148k free, 2960760k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24342 alice 15 0 90224 1864 1096 S 0.0 0.0 0:00.26 sshd 24343 alice 15 0 66096 1580 1232 S 0.0 0.0 0:00.07 bash 25927 alice 15 0 12760 1196 836 R 0.0 0.0 0:00.01 top 30052 alice 16 0 90720 2456 1112 S 0.0 0.1 0:00.69 sshd 30053 alice 18 0 66096 1624 1236 S 0.0 0.0 0:00.37 bash The top command (shown here with an option to limit the output to a single user ID) also shows information about running processes, but updates periodically by itself. Type the letter q to quit the interactive display. Again, the highlighted RES column shows an approximation of memory usage.","title":"For Memory:"},{"location":"materials/htcondor/part1-ex5-request/#for-disk","text":"Determining disk needs may be a bit easier, because you can check on the size of files that a program is using while it runs. However, it is important to count all files that HTCondor counts to get an accurate size. HTCondor counts everything in your job sandbox toward your job\u2019s disk usage: The executable itself All \"input\" files (anything else that gets transferred TO the job, even if you don't think of it as \"input\") All files created during the job (broadly defined as \"output\"), including the captured standard output and error files that you list in the submit file. All temporary files created in the sandbox, even if they get deleted by the executable before it's done. If you can run your program within a single directory on a local computer (not on the access point), you should be able to view files and their sizes with the ls and du commands.","title":"For Disk:"},{"location":"materials/htcondor/part1-ex5-request/#determining-resource-needs-by-running-test-jobs-best","text":"Despite the techniques mentioned above, by far the easiest approach to measuring your job\u2019s resource needs is to run one or a small number of sample jobs and have HTCondor itself tell you about the resources used during the runs. For example, here is a strange Python script that does not do anything useful, but consumes some real resources while running: #!/usr/bin/env python3 import time import os size = 1000000 numbers = [] for i in range ( size ): numbers . append ( str ( i )) with open ( 'numbers.txt' , 'w' ) as tempfile : tempfile . write ( ' ' . join ( numbers )) time . sleep ( 60 ) Without trying to figure out what this code does or how many resources it uses, create a submit file for it, and run it once with HTCondor, starting with somewhat high memory requests (\"1GB\" for memory and disk is a good starting point, unless you think the job will use far more). When it is done, examine the log file. In particular, we care about these lines: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 6739 1048576 8022934 Memory (MB) : 3 1024 1024 So, now we know that HTCondor saw that the job used 6,739 KB of disk (= about 6.5 MB) and 3 MB of memory! This is a great technique for determining the real resource needs of your job. If you think resource needs vary from run to run, submit a few sample jobs and look at all the results. You should round up your resource requests a little, just in case your job occasionally uses more resources.","title":"Determining Resource Needs By Running Test Jobs (BEST)"},{"location":"materials/htcondor/part1-ex5-request/#setting-resource-requirements","text":"Once you know your job\u2019s resource requirements, it is easy to declare them in your submit file. For example, taking our results above as an example, we might slightly increase our requests above what was used, just to be safe: # rounded up from 3 MB request_memory = 4MB # rounded up from 6.5 MB request_disk = 7MB Pay close attention to units: Without explicit units, request_memory is in MB (megabytes) Without explicit units, request_disk is in KB (kilobytes) Allowable units are KB (kilobytes), MB (megabytes), GB (gigabytes), and TB (terabytes) HTCondor translates these requirements into attributes that become part of the job's requirements expression. However, do not put your CPU, memory, and disk requirements directly into the requirements expression; use the request_XXX statements instead. If you still have time in this working session, Add these requirements to your submit file for the Python script, rerun the job, and confirm in the log file that your requests were used. After changing the requirements in your submit file, did your job run successfully? If not, why? (Hint: HTCondor polls a job's resource use on a timer. How long are these jobs running for?)","title":"Setting Resource Requirements"},{"location":"materials/htcondor/part1-ex6-remove/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.6: Remove Jobs From the Queue \u00b6 Exercise Goal \u00b6 The goal of this exercise is to show you how to remove jobs from the queue. This is helpful if you make a mistake, do not want to wait for a job to complete, or otherwise need to fix things. For example, if some test jobs go on hold for using too much memory or disk, you may want to just remove them, edit the submit files, and then submit again. Skip this exercise and come back to it if you are short on time, or until you need to remove jobs for other exercises Note Please remember to remove any jobs from the queue that you are no longer interested in. Otherwise, the queue will start to get very long with jobs that will waste resources (and decrease your priority), or that may never run (if they're on hold, or have other issues keeping them from matching). This exercise is very short, but if you are out of time, you can come back to it later. Removing a Job or Cluster From the Queue \u00b6 To practice removing jobs from the queue, you need a job in the queue! Submit a job from an earlier exercise Determine the job ID ( cluster.process ) from the condor_submit output or from condor_q Remove the job: username@ap1 $ condor_rm Use the full job ID this time, e.g. 5759.0 . Did the job leave the queue immediately? If not, about how long did it take? So far, we have created job clusters that contain only one job process (the .0 part of the job ID). That will change soon, so it is good to know how to remove a specific job ID. However, it is possible to remove all jobs that are part of a cluster at once. Simply omit the job process (the .0 part of the job ID) in the condor_rm command: username@ap1 $ condor_rm Finally, you can include many job clusters and full job IDs in a single condor_rm command. For example: username@ap1 $ condor_rm 5768 5769 5770 .0 5771 .2 Removing All of Your Jobs \u00b6 If you really want to remove all of your jobs at once, you can do that with: username@ap1 $ condor_rm If you want to test it: (optional, though you'll likely need this in the future) Quickly submit several jobs from past exercises View the jobs in the queue with condor_q Remove them all with the above command Use condor_q to track progress In case you are wondering, you can remove only your own jobs. HTCondor administrators can remove anyone\u2019s jobs, so be nice to them.","title":"1.6 - Remove jobs from the queue"},{"location":"materials/htcondor/part1-ex6-remove/#htc-exercise-16-remove-jobs-from-the-queue","text":"","title":"HTC Exercise 1.6: Remove Jobs From the Queue"},{"location":"materials/htcondor/part1-ex6-remove/#exercise-goal","text":"The goal of this exercise is to show you how to remove jobs from the queue. This is helpful if you make a mistake, do not want to wait for a job to complete, or otherwise need to fix things. For example, if some test jobs go on hold for using too much memory or disk, you may want to just remove them, edit the submit files, and then submit again. Skip this exercise and come back to it if you are short on time, or until you need to remove jobs for other exercises Note Please remember to remove any jobs from the queue that you are no longer interested in. Otherwise, the queue will start to get very long with jobs that will waste resources (and decrease your priority), or that may never run (if they're on hold, or have other issues keeping them from matching). This exercise is very short, but if you are out of time, you can come back to it later.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex6-remove/#removing-a-job-or-cluster-from-the-queue","text":"To practice removing jobs from the queue, you need a job in the queue! Submit a job from an earlier exercise Determine the job ID ( cluster.process ) from the condor_submit output or from condor_q Remove the job: username@ap1 $ condor_rm Use the full job ID this time, e.g. 5759.0 . Did the job leave the queue immediately? If not, about how long did it take? So far, we have created job clusters that contain only one job process (the .0 part of the job ID). That will change soon, so it is good to know how to remove a specific job ID. However, it is possible to remove all jobs that are part of a cluster at once. Simply omit the job process (the .0 part of the job ID) in the condor_rm command: username@ap1 $ condor_rm Finally, you can include many job clusters and full job IDs in a single condor_rm command. For example: username@ap1 $ condor_rm 5768 5769 5770 .0 5771 .2","title":"Removing a Job or Cluster From the Queue"},{"location":"materials/htcondor/part1-ex6-remove/#removing-all-of-your-jobs","text":"If you really want to remove all of your jobs at once, you can do that with: username@ap1 $ condor_rm If you want to test it: (optional, though you'll likely need this in the future) Quickly submit several jobs from past exercises View the jobs in the queue with condor_q Remove them all with the above command Use condor_q to track progress In case you are wondering, you can remove only your own jobs. HTCondor administrators can remove anyone\u2019s jobs, so be nice to them.","title":"Removing All of Your Jobs"},{"location":"materials/htcondor/part1-ex7-compile/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Bonus Exercise 1.7: Compile and Run Some C Code \u00b6 The goal of this exercise is to show that compiled code works just fine in HTCondor. It is mainly of interest to people who have their own C code to run (or C++, or really any compiled code, although Java would be handled a bit differently). Preparing a C Executable \u00b6 When preparing a C program for HTCondor, it is best to compile and link the executable statically, so that it does not depend on external libraries and their particular versions. Why is this important? When your compiled C program is sent to another machine for execution, that machine may not have the same libraries that you have on your submit machine (or wherever you compile the program). If the libraries are not available or are the wrong versions, your program may fail or, perhaps worse, silently produce the wrong results. Here is a simple C program to try using (thanks, Alain Roy): #include #include #include int main ( int argc , char ** argv ) { int sleep_time ; int input ; int failure ; if ( argc != 3 ) { printf ( \"Usage: simple \\n \" ); failure = 1 ; } else { sleep_time = atoi ( argv [ 1 ]); input = atoi ( argv [ 2 ]); printf ( \"Thinking really hard for %d seconds... \\n \" , sleep_time ); sleep ( sleep_time ); printf ( \"We calculated: %d \\n \" , input * 2 ); failure = 0 ; } return failure ; } Save that code to a file, for example, simple.c . Compile the program with static linking: username@ap1 $ gcc -static -o simple simple.c As always, test that you can run your command from the command line first. First, without arguments to make sure it fails correctly: username@ap1 $ ./simple and then with valid arguments: username@ap1 $ ./simple 5 21 Running a Compiled C Program \u00b6 Running the compiled program is no different than running any other program. Here is a submit file for the C program (call it simple.sub): executable = simple arguments = \"60 64\" output = c-program.out error = c-program.err log = c-program.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1MB queue Then submit the job as usual! In summary, it is easy to work with statically linked compiled code. It is possible to handle dynamically linked compiled code, but it is trickier. We will only mention this topic briefly during the lecture on Software.","title":"Bonus Exercise 1.7 - Compile and run some C code"},{"location":"materials/htcondor/part1-ex7-compile/#htc-bonus-exercise-17-compile-and-run-some-c-code","text":"The goal of this exercise is to show that compiled code works just fine in HTCondor. It is mainly of interest to people who have their own C code to run (or C++, or really any compiled code, although Java would be handled a bit differently).","title":"HTC Bonus Exercise 1.7: Compile and Run Some C Code"},{"location":"materials/htcondor/part1-ex7-compile/#preparing-a-c-executable","text":"When preparing a C program for HTCondor, it is best to compile and link the executable statically, so that it does not depend on external libraries and their particular versions. Why is this important? When your compiled C program is sent to another machine for execution, that machine may not have the same libraries that you have on your submit machine (or wherever you compile the program). If the libraries are not available or are the wrong versions, your program may fail or, perhaps worse, silently produce the wrong results. Here is a simple C program to try using (thanks, Alain Roy): #include #include #include int main ( int argc , char ** argv ) { int sleep_time ; int input ; int failure ; if ( argc != 3 ) { printf ( \"Usage: simple \\n \" ); failure = 1 ; } else { sleep_time = atoi ( argv [ 1 ]); input = atoi ( argv [ 2 ]); printf ( \"Thinking really hard for %d seconds... \\n \" , sleep_time ); sleep ( sleep_time ); printf ( \"We calculated: %d \\n \" , input * 2 ); failure = 0 ; } return failure ; } Save that code to a file, for example, simple.c . Compile the program with static linking: username@ap1 $ gcc -static -o simple simple.c As always, test that you can run your command from the command line first. First, without arguments to make sure it fails correctly: username@ap1 $ ./simple and then with valid arguments: username@ap1 $ ./simple 5 21","title":"Preparing a C Executable"},{"location":"materials/htcondor/part1-ex7-compile/#running-a-compiled-c-program","text":"Running the compiled program is no different than running any other program. Here is a submit file for the C program (call it simple.sub): executable = simple arguments = \"60 64\" output = c-program.out error = c-program.err log = c-program.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1MB queue Then submit the job as usual! In summary, it is easy to work with statically linked compiled code. It is possible to handle dynamically linked compiled code, but it is trickier. We will only mention this topic briefly during the lecture on Software.","title":"Running a Compiled C Program"},{"location":"materials/htcondor/part1-ex8-queue/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus HTC Exercise 1.8: Explore condor_q \u00b6 The goal of this exercise is try out some of the most common options to the condor_q command, so that you can view jobs effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_q expert! Selecting Jobs \u00b6 The condor_q program has many options for selecting which jobs are listed. You have already seen that the default mode is to show only your jobs in \"batch\" mode: username@ap1 $ condor_q You've seen that you can view all jobs (all users) in the submit node's queue by using the -all argument: username@ap1 $ condor_q -all And you've seen that you can view more details about queued jobs, with each separate job on a single line using the -nobatch option: username@ap1 $ condor_q -nobatch username@ap1 $ condor_q -all -nobatch Did you know you can also name one or more user IDs on the command line, in which case jobs for all of the named users are listed at once? username@ap1 $ condor_q To list just the jobs associated with a single cluster number: username@ap1 $ condor_q For example, if you want to see the jobs in cluster 5678 (i.e., 5678.0 , 5678.1 , etc.), you use condor_q 5678 . To list a specific job (i.e., cluster.process, as in 5678.0): username@ap1 $ condor_q For example, to see job ID 5678.1, you use condor_q 5678.1 . Note You can name more than one cluster, job ID, or combination thereof on the command line, in which case jobs for all of the named clusters and/or job IDs are listed. Let\u2019s get some practice using condor_q selections! Using a previous exercise, submit several sleep jobs. List all jobs in the queue \u2014 are there others besides your own? Practice using all forms of condor_q that you have learned: List just your jobs, with and without batching. List a specific cluster. List a specific job ID. Try listing several users at once. Try listing several clusters and job IDs at once. When there are a variety of jobs in the queue, try combining a username and a different user's cluster or job ID in the same command \u2014 what happens? Viewing a Job ClassAd \u00b6 You may have wondered why it is useful to be able to list a single job ID using condor_q . By itself, it may not be that useful. But, in combination with another option, it is very useful! If you add the -long option to condor_q (or its short form, -l ), it will show the complete ClassAd for each selected job, instead of the one-line summary that you have seen so far. Because job ClassAds may have 80\u201390 attributes (or more), it probably makes the most sense to show the ClassAd for a single job at a time. And you know how to show just one job! Here is what the command looks like: username@ap1 $ condor_q -long The output from this command is long and complex. Most of the attributes that HTCondor adds to a job are arcane and uninteresting for us now. But here are some examples of common, interesting attributes taken directly from condor_q output (except with some line breaks added to the Requirements attribute): MyType = \"Job\" Err = \"sleep.err\" UserLog = \"/home/cat/intro-2.1-queue/sleep.log\" Requirements = ( IsOSGSchoolSlot =?= true ) && ( TARGET.Arch == \"X86_64\" ) && ( TARGET.OpSys == \"LINUX\" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) ClusterId = 2420 WhenToTransferOutput = \"ON_EXIT\" Owner = \"cat\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Out = \"sleep.out\" Cmd = \"/bin/sleep\" Arguments = \"120\" Note Attributes are listed in no particular order and may change from time to time. Do not assume anything about the order of attributes in condor_q output. See what you can find in a job ClassAd from your own job. Using a previous exercise, submit a sleep job that sleeps for at least 3 minutes (180 seconds). Before the job executes, capture its ClassAd and save to a file: condor_q -l > classad-1.txt After the job starts execution but before it finishes, capture its ClassAd again and save to a file condor_q -l > classad-2.txt Now examine each saved ClassAd file. Here are a few things to look for: Can you find attributes that came from your submit file? (E.g., Cmd, Arguments, Out, Err, UserLog, and so forth) Can you find attributes that could have come from your submit file, but that HTCondor added for you? (E.g., Requirements) How many of the following attributes can you guess the meaning of? DiskUsage ImageSize BytesSent JobStatus Why Is My Job Not Running? \u00b6 Sometimes, you submit a job and it just sits in the queue in Idle state, never running. It can be difficult to figure out why a job never matches and runs. Fortunately, HTCondor can give you some help. To ask HTCondor why your job is not running, add the -better-analyze option to condor_q for the specific job. For example, for job ID 2423.0, the command is: username@ap1 $ condor_q -better-analyze 2423 .0 Of course, replace the job ID with your own. Let\u2019s submit a job that will never run and see what happens. Here is the submit file to use: executable = /bin/hostname output = norun.out error = norun.err log = norun.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_disk = 10MB request_memory = 8TB queue (Do you see what I did?) Save and submit this file. Run condor_q -better-analyze on the job ID. There is a lot of output, but a few items are worth highlighting. Here is a sample from my own job (with some lines omitted): -- Schedd: ap1.facility.path-cc.io : <128.105.68.66:9618?... ... Job 98096.000 defines the following attributes: RequestDisk = 10240 RequestMemory = 8388608 The Requirements expression for job 98096.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [1] 11227 Target.OpSysMajorVer == 7 [9] 13098 TARGET.Disk >= RequestDisk [11] 0 TARGET.Memory >= RequestMemory No successful match recorded. Last failed match: Fri Jul 12 15:36:30 2019 Reason for last match failure: no match found 98096.000: Run analysis summary ignoring user priority. Of 710 machines, 710 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are able to run your job ... At the end of the summary, condor_q provides a breakdown of how machines and their own requirements match against my own job's requirements. 710 total machines were considered above, and all of them were rejected based on my job's requirements . In other words, I am asking for something that is not available. But what? Further up in the output, there is an analysis of the job's requirements, along with how many slots within the pool match each of those requirements. The example above reports that 13098 slots match our small disk request request, but none of the slots matched the TARGET.Memory >= RequestMemory condition. The output also reports the value used for the RequestMemory attribute: my job asked for 8 terabytes of memory (8,388,608 MB) -- of course no machines matched that part of the expression! That's a lot of memory on today's machines. The output from condor_q -analyze (and condor_q -better-analyze ) may be helpful or it may not be, depending on your exact case. The example above was constructed so that it would be obvious what the problem was. But in many cases, this is a good place to start looking if you are having problems matching. Bonus: Automatic Formatting Output \u00b6 Do this exercise only if you have time, though it's pretty awesome! There is a way to select the specific job attributes you want condor_q to tell you about with the -autoformat or -af option. In this case, HTCondor decides for you how to format the data you ask for from job ClassAd(s). (To tell HTCondor how to specially format this information, yourself, you could use the -format option, which we're not covering.) To use autoformatting, use the -af option followed by the attribute name, for each attribute that you want to output: username@ap1 $ condor_q -all -af Owner ClusterId Cmd moate 2418 /share/test.sh cat 2421 /bin/sleep cat 2422 /bin/sleep Bonus Question : If you wanted to print out the Requirements expression of a job, how would you do that with -af ? Is the output what you expected? (HINT: for ClassAd attributes like \"Requirements\" that are long expressions, instead of plain values, you can use -af:r to view the expressions, instead of what it's current evaluation.) References \u00b6 As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_q man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the HTCondor Manual","title":"Bonus Exercise 1.8 - Explore condor_q"},{"location":"materials/htcondor/part1-ex8-queue/#bonus-htc-exercise-18-explore-condor_q","text":"The goal of this exercise is try out some of the most common options to the condor_q command, so that you can view jobs effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_q expert!","title":"Bonus HTC Exercise 1.8: Explore condor_q"},{"location":"materials/htcondor/part1-ex8-queue/#selecting-jobs","text":"The condor_q program has many options for selecting which jobs are listed. You have already seen that the default mode is to show only your jobs in \"batch\" mode: username@ap1 $ condor_q You've seen that you can view all jobs (all users) in the submit node's queue by using the -all argument: username@ap1 $ condor_q -all And you've seen that you can view more details about queued jobs, with each separate job on a single line using the -nobatch option: username@ap1 $ condor_q -nobatch username@ap1 $ condor_q -all -nobatch Did you know you can also name one or more user IDs on the command line, in which case jobs for all of the named users are listed at once? username@ap1 $ condor_q To list just the jobs associated with a single cluster number: username@ap1 $ condor_q For example, if you want to see the jobs in cluster 5678 (i.e., 5678.0 , 5678.1 , etc.), you use condor_q 5678 . To list a specific job (i.e., cluster.process, as in 5678.0): username@ap1 $ condor_q For example, to see job ID 5678.1, you use condor_q 5678.1 . Note You can name more than one cluster, job ID, or combination thereof on the command line, in which case jobs for all of the named clusters and/or job IDs are listed. Let\u2019s get some practice using condor_q selections! Using a previous exercise, submit several sleep jobs. List all jobs in the queue \u2014 are there others besides your own? Practice using all forms of condor_q that you have learned: List just your jobs, with and without batching. List a specific cluster. List a specific job ID. Try listing several users at once. Try listing several clusters and job IDs at once. When there are a variety of jobs in the queue, try combining a username and a different user's cluster or job ID in the same command \u2014 what happens?","title":"Selecting Jobs"},{"location":"materials/htcondor/part1-ex8-queue/#viewing-a-job-classad","text":"You may have wondered why it is useful to be able to list a single job ID using condor_q . By itself, it may not be that useful. But, in combination with another option, it is very useful! If you add the -long option to condor_q (or its short form, -l ), it will show the complete ClassAd for each selected job, instead of the one-line summary that you have seen so far. Because job ClassAds may have 80\u201390 attributes (or more), it probably makes the most sense to show the ClassAd for a single job at a time. And you know how to show just one job! Here is what the command looks like: username@ap1 $ condor_q -long The output from this command is long and complex. Most of the attributes that HTCondor adds to a job are arcane and uninteresting for us now. But here are some examples of common, interesting attributes taken directly from condor_q output (except with some line breaks added to the Requirements attribute): MyType = \"Job\" Err = \"sleep.err\" UserLog = \"/home/cat/intro-2.1-queue/sleep.log\" Requirements = ( IsOSGSchoolSlot =?= true ) && ( TARGET.Arch == \"X86_64\" ) && ( TARGET.OpSys == \"LINUX\" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) ClusterId = 2420 WhenToTransferOutput = \"ON_EXIT\" Owner = \"cat\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Out = \"sleep.out\" Cmd = \"/bin/sleep\" Arguments = \"120\" Note Attributes are listed in no particular order and may change from time to time. Do not assume anything about the order of attributes in condor_q output. See what you can find in a job ClassAd from your own job. Using a previous exercise, submit a sleep job that sleeps for at least 3 minutes (180 seconds). Before the job executes, capture its ClassAd and save to a file: condor_q -l > classad-1.txt After the job starts execution but before it finishes, capture its ClassAd again and save to a file condor_q -l > classad-2.txt Now examine each saved ClassAd file. Here are a few things to look for: Can you find attributes that came from your submit file? (E.g., Cmd, Arguments, Out, Err, UserLog, and so forth) Can you find attributes that could have come from your submit file, but that HTCondor added for you? (E.g., Requirements) How many of the following attributes can you guess the meaning of? DiskUsage ImageSize BytesSent JobStatus","title":"Viewing a Job ClassAd"},{"location":"materials/htcondor/part1-ex8-queue/#why-is-my-job-not-running","text":"Sometimes, you submit a job and it just sits in the queue in Idle state, never running. It can be difficult to figure out why a job never matches and runs. Fortunately, HTCondor can give you some help. To ask HTCondor why your job is not running, add the -better-analyze option to condor_q for the specific job. For example, for job ID 2423.0, the command is: username@ap1 $ condor_q -better-analyze 2423 .0 Of course, replace the job ID with your own. Let\u2019s submit a job that will never run and see what happens. Here is the submit file to use: executable = /bin/hostname output = norun.out error = norun.err log = norun.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_disk = 10MB request_memory = 8TB queue (Do you see what I did?) Save and submit this file. Run condor_q -better-analyze on the job ID. There is a lot of output, but a few items are worth highlighting. Here is a sample from my own job (with some lines omitted): -- Schedd: ap1.facility.path-cc.io : <128.105.68.66:9618?... ... Job 98096.000 defines the following attributes: RequestDisk = 10240 RequestMemory = 8388608 The Requirements expression for job 98096.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [1] 11227 Target.OpSysMajorVer == 7 [9] 13098 TARGET.Disk >= RequestDisk [11] 0 TARGET.Memory >= RequestMemory No successful match recorded. Last failed match: Fri Jul 12 15:36:30 2019 Reason for last match failure: no match found 98096.000: Run analysis summary ignoring user priority. Of 710 machines, 710 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are able to run your job ... At the end of the summary, condor_q provides a breakdown of how machines and their own requirements match against my own job's requirements. 710 total machines were considered above, and all of them were rejected based on my job's requirements . In other words, I am asking for something that is not available. But what? Further up in the output, there is an analysis of the job's requirements, along with how many slots within the pool match each of those requirements. The example above reports that 13098 slots match our small disk request request, but none of the slots matched the TARGET.Memory >= RequestMemory condition. The output also reports the value used for the RequestMemory attribute: my job asked for 8 terabytes of memory (8,388,608 MB) -- of course no machines matched that part of the expression! That's a lot of memory on today's machines. The output from condor_q -analyze (and condor_q -better-analyze ) may be helpful or it may not be, depending on your exact case. The example above was constructed so that it would be obvious what the problem was. But in many cases, this is a good place to start looking if you are having problems matching.","title":"Why Is My Job Not Running?"},{"location":"materials/htcondor/part1-ex8-queue/#bonus-automatic-formatting-output","text":"Do this exercise only if you have time, though it's pretty awesome! There is a way to select the specific job attributes you want condor_q to tell you about with the -autoformat or -af option. In this case, HTCondor decides for you how to format the data you ask for from job ClassAd(s). (To tell HTCondor how to specially format this information, yourself, you could use the -format option, which we're not covering.) To use autoformatting, use the -af option followed by the attribute name, for each attribute that you want to output: username@ap1 $ condor_q -all -af Owner ClusterId Cmd moate 2418 /share/test.sh cat 2421 /bin/sleep cat 2422 /bin/sleep Bonus Question : If you wanted to print out the Requirements expression of a job, how would you do that with -af ? Is the output what you expected? (HINT: for ClassAd attributes like \"Requirements\" that are long expressions, instead of plain values, you can use -af:r to view the expressions, instead of what it's current evaluation.)","title":"Bonus: Automatic Formatting Output"},{"location":"materials/htcondor/part1-ex8-queue/#references","text":"As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_q man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the HTCondor Manual","title":"References"},{"location":"materials/htcondor/part1-ex9-status/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus HTC Exercise 1.9: Explore condor_status \u00b6 The goal of this exercise is try out some of the most common options to the condor_status command, so that you can view slots effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_status expert! Selecting Slots \u00b6 The condor_status program has many options for selecting which slots are listed. You've already learned the basic condor_status and the condor_status -compact variation (which you may wish to retry now, before proceeding). Another convenient option is to list only those slots that are available now: username@ap1 $ condor_status -avail Of course, the individual execute machines only report their slots to the collector at certain time intervals, so this list will not reflect the up-to-the-second reality of all slots. But this limitation is true of all condor_status output, not just with the -avail option. Similar to condor_q , you can limit the slots that are listed in two easy ways. To list just the slots on a specific machine: username@ap1 $ condor_status For example, if you want to see the slots on e2337.chtc.wisc.edu (in the CHTC pool): username@ap1 $ condor_status e2337.chtc.wisc.edu To list a specific slot on a machine: username@ap1 $ condor_status @ For example, to see the \u201cfirst\u201d slot on the machine above: username@ap1 $ condor_status slot1@e2337.chtc.wisc.edu Note You can name more than one hostname, slot, or combination thereof on the command line, in which case slots for all of the named hostnames and/or slots are listed. Let\u2019s get some practice using condor_status selections! List all slots in the pool \u2014 how many are there total? Practice using all forms of condor_status that you have learned: List the available slots. List the slots on a specific machine (e.g., e2337.chtc.wisc.edu ). List a specific slot from that machine. Try listing the slots from a few (but not all) machines at once. Try using a mix of hostnames and slot IDs at once. Viewing a Slot ClassAd \u00b6 Just as with condor_q , you can use condor_status to view the complete ClassAd for a given slot (often confusingly called the \u201cmachine\u201d ad): username@ap1 $ condor_status -long @ Because slot ClassAds may have 150\u2013200 attributes (or more), it probably makes the most sense to show the ClassAd for a single slot at a time, as shown above. Here are some examples of common, interesting attributes taken directly from condor_status output: OpSys = \"LINUX\" DetectedCpus = 24 OpSysAndVer = \"SL6\" MyType = \"Machine\" LoadAvg = 0.99 TotalDisk = 798098404 OSIssue = \"Scientific Linux release 6.6 (Carbon)\" TotalMemory = 24016 Machine = \"e242.chtc.wisc.edu\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Memory = 1024 As you may be able to tell, there is a mix of attributes about the machine as a whole (hence the name \u201cmachine ad\u201d) and about the slot in particular. Go ahead and examine a machine ClassAd now. Viewing Slots by ClassAd Expression \u00b6 Often, it is helpful to view slots that meet some particular criteria. For example, if you know that your job needs a lot of memory to run, you may want to see how many high-memory slots there are and whether they are busy. You can filter the list of slots like this using the -constraint option and a ClassAd expression. For example, suppose we want to list all slots that are running Scientific Linux 7 (operating system) and have at least 16 GB memory available. Note that memory is reported in units of Megabytes. The command is: username@ap1 $ condor_status -constraint 'OpSysAndVer == \"CentOS7\" && Memory >= 16000' Note Be very careful with using quote characters appropriately in these commands. In the example above, the single quotes ( ' ) are for the shell, so that the entire expression is passed to condor_status untouched, and the double quotes ( \" ) surround a string value within the expression itself. Currently on PATh, there are only a few slots that meet these criteria (our high-memory servers, mainly used for metagenomics assemblies). If you are interested in learning more about writing ClassAd expressions, look at section 4.1 and especially 4.1.4 of the HTCondor Manual. This is definitely advanced material, so if you do not want to read it, that is fine. But if you do, take some time to practice writing expressions for the condor_status -constraint command. Note The condor_q command accepts the -constraint option as well! As you might expect, the option allows you to limit the jobs that are listed based on a ClassAd expression. Bonus: Formatting Output \u00b6 The condor_status command accepts the same -autoformat ( -af ) options that condor_q accepts, and the options have the same meanings in both commands. Of course, the attributes available in machine ads may differ from the ones that are available in job ads. Use the HTCondor Manual or look at individual slot ClassAds to get a better idea of what attributes are available. For example, I was curious about the host name and operating system of the slots with more than 32GB of memory: username@ap1 $ condor_status -af Machine -af OpSysAndVer -constraint 'Memory >= 32000' If you like, spend a few minutes now or later experimenting with condor_status formatting. References \u00b6 As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_status man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the appendix of the HTCondor Manual Read about ClassAd expressions in section 4.1.4 of the HTCondor Manual","title":"Bonus Exercise 1.9- Explore condor_stataus"},{"location":"materials/htcondor/part1-ex9-status/#bonus-htc-exercise-19-explore-condor_status","text":"The goal of this exercise is try out some of the most common options to the condor_status command, so that you can view slots effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_status expert!","title":"Bonus HTC Exercise 1.9: Explore condor_status"},{"location":"materials/htcondor/part1-ex9-status/#selecting-slots","text":"The condor_status program has many options for selecting which slots are listed. You've already learned the basic condor_status and the condor_status -compact variation (which you may wish to retry now, before proceeding). Another convenient option is to list only those slots that are available now: username@ap1 $ condor_status -avail Of course, the individual execute machines only report their slots to the collector at certain time intervals, so this list will not reflect the up-to-the-second reality of all slots. But this limitation is true of all condor_status output, not just with the -avail option. Similar to condor_q , you can limit the slots that are listed in two easy ways. To list just the slots on a specific machine: username@ap1 $ condor_status For example, if you want to see the slots on e2337.chtc.wisc.edu (in the CHTC pool): username@ap1 $ condor_status e2337.chtc.wisc.edu To list a specific slot on a machine: username@ap1 $ condor_status @ For example, to see the \u201cfirst\u201d slot on the machine above: username@ap1 $ condor_status slot1@e2337.chtc.wisc.edu Note You can name more than one hostname, slot, or combination thereof on the command line, in which case slots for all of the named hostnames and/or slots are listed. Let\u2019s get some practice using condor_status selections! List all slots in the pool \u2014 how many are there total? Practice using all forms of condor_status that you have learned: List the available slots. List the slots on a specific machine (e.g., e2337.chtc.wisc.edu ). List a specific slot from that machine. Try listing the slots from a few (but not all) machines at once. Try using a mix of hostnames and slot IDs at once.","title":"Selecting Slots"},{"location":"materials/htcondor/part1-ex9-status/#viewing-a-slot-classad","text":"Just as with condor_q , you can use condor_status to view the complete ClassAd for a given slot (often confusingly called the \u201cmachine\u201d ad): username@ap1 $ condor_status -long @ Because slot ClassAds may have 150\u2013200 attributes (or more), it probably makes the most sense to show the ClassAd for a single slot at a time, as shown above. Here are some examples of common, interesting attributes taken directly from condor_status output: OpSys = \"LINUX\" DetectedCpus = 24 OpSysAndVer = \"SL6\" MyType = \"Machine\" LoadAvg = 0.99 TotalDisk = 798098404 OSIssue = \"Scientific Linux release 6.6 (Carbon)\" TotalMemory = 24016 Machine = \"e242.chtc.wisc.edu\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Memory = 1024 As you may be able to tell, there is a mix of attributes about the machine as a whole (hence the name \u201cmachine ad\u201d) and about the slot in particular. Go ahead and examine a machine ClassAd now.","title":"Viewing a Slot ClassAd"},{"location":"materials/htcondor/part1-ex9-status/#viewing-slots-by-classad-expression","text":"Often, it is helpful to view slots that meet some particular criteria. For example, if you know that your job needs a lot of memory to run, you may want to see how many high-memory slots there are and whether they are busy. You can filter the list of slots like this using the -constraint option and a ClassAd expression. For example, suppose we want to list all slots that are running Scientific Linux 7 (operating system) and have at least 16 GB memory available. Note that memory is reported in units of Megabytes. The command is: username@ap1 $ condor_status -constraint 'OpSysAndVer == \"CentOS7\" && Memory >= 16000' Note Be very careful with using quote characters appropriately in these commands. In the example above, the single quotes ( ' ) are for the shell, so that the entire expression is passed to condor_status untouched, and the double quotes ( \" ) surround a string value within the expression itself. Currently on PATh, there are only a few slots that meet these criteria (our high-memory servers, mainly used for metagenomics assemblies). If you are interested in learning more about writing ClassAd expressions, look at section 4.1 and especially 4.1.4 of the HTCondor Manual. This is definitely advanced material, so if you do not want to read it, that is fine. But if you do, take some time to practice writing expressions for the condor_status -constraint command. Note The condor_q command accepts the -constraint option as well! As you might expect, the option allows you to limit the jobs that are listed based on a ClassAd expression.","title":"Viewing Slots by ClassAd Expression"},{"location":"materials/htcondor/part1-ex9-status/#bonus-formatting-output","text":"The condor_status command accepts the same -autoformat ( -af ) options that condor_q accepts, and the options have the same meanings in both commands. Of course, the attributes available in machine ads may differ from the ones that are available in job ads. Use the HTCondor Manual or look at individual slot ClassAds to get a better idea of what attributes are available. For example, I was curious about the host name and operating system of the slots with more than 32GB of memory: username@ap1 $ condor_status -af Machine -af OpSysAndVer -constraint 'Memory >= 32000' If you like, spend a few minutes now or later experimenting with condor_status formatting.","title":"Bonus: Formatting Output"},{"location":"materials/htcondor/part1-ex9-status/#references","text":"As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_status man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the appendix of the HTCondor Manual Read about ClassAd expressions in section 4.1.4 of the HTCondor Manual","title":"References"},{"location":"materials/htcondor/part2-ex1-files/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 2.1: Work With Input and Output Files \u00b6 Exercise Goal \u00b6 The goal of this exercise is make input files available to your job on the execute machine and to return output files back created in your job back to you on the access point. This small change significantly adds to the kinds of jobs that you can run. Viewing a Job Sandbox \u00b6 Before you learn to transfer files to and from your job, it is good to understand a bit more about the environment in which your job runs. When the HTCondor starter process prepares to run your job, it creates a new directory for your job and all of its files. We call this directory the job sandbox , because it is your job\u2019s private space to play. Let\u2019s see what is in the job sandbox for a minimal job with no special input or output files. Save the script below in a file named sandbox.sh : #!/bin/sh echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'Sandbox: ' ` pwd ` ls -alF # END Create a submit file for this script and submit it. When the job finishes, look at the contents of the output file. In the output file, note the Sandbox: line: That is the full path to your job sandbox for the run. It was created just for your job, and it was removed as soon as your job finished. Next, look at the output that appears after the Sandbox: line; it is the output from the ls command in the script. It shows all of the files in your job sandbox, as they existed at the end of the execution of sandbox.sh . The number of files that you see can change depending on the HTC system you are using, but some of the files you should always see are: .chirp.config Configuration for an advanced feature sandbox.sh Your executable .job.ad The job ClassAd .machine.ad The machine ClassAd _condor_stderr Saved standard error from the job _condor_stdout Saved standard output from the job tmp/ , var/tmp/ Directories in which to put temporary files So, HTCondor wrote copies of the job and machine ads (for use by the job, if desired), transferred your executable ( sandbox.sh ), ran it, and saved its standard output and standard error into files. Notice that your submit file, which was in the same directory on the access point machine as your executable, was not transferred, nor were any other files that happened to be in directory with the submit file. Now that we know something about the sandbox, we can transfer more files to and from it. Running a Job With Input Files \u00b6 Next, you will run a job that requires an input file. Remember, the initial job sandbox will contain only the job executable, unless you tell HTCondor explicitly about every other file that needs to be transferred to the job. Here is a Python script that takes the name of an input file (containing one word per line) from the command line, counts the number of times each (lowercased) word occurs in the text, and prints out the final list of words and their counts. #!/usr/bin/env python3 import os import sys if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' , encoding = 'iso-8859-1' ) as my_file : for line in my_file : word = line . strip () . lower () if word in words : words [ word ] += 1 else : words [ word ] = 1 for word in sorted ( words . keys ()): print ( f ' { words [ word ] : 8d } { word } ' ) Create and save the Python script in a file named freq.py . Download the input file for the script (263K lines, ~1.4 MB) and save it in your submit directory: username@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/intro-2.1-words.txt Create a submit file for the freq.py executable. Add a line called transfer_input_files = to tell HTCondor to transfer the input file to the job: transfer_input_files = intro-2.1-words.txt As with all submit file commands, it does not matter where this line goes, as long as it comes before the word queue . Since we want HTCondor to pass an argument to our Python executable, we need to remember to add an arguments = line in our submit file so that HTCondor knows to pass an argument to the job. Set this arguments = line equal to the argument to the Python script (i.e., the name the input file). Submit the job to HTCondor, wait for it to finish, and check the output! If things do not work the first time, keep trying! At this point in the exercises, we are telling you less and less explicitly how to do steps that you have done before. If you get stuck, ask for help in the Slack channel. Note If you want to transfer more than one input file, list all of them on a single transfer_input_files command, separated by commas. For example, if there are three input files: transfer_input_files = a.txt, b.txt, c.txt Transferring Output Files \u00b6 So far, we have relied on programs that send their output to the standard output and error streams, which HTCondor captures, saves, and returns back to the submit directory. But what if your program writes one or more files for its output? How do you tell HTCondor to bring them back? Let\u2019s start by exploring what happens to files that a job creates in the sandbox. We will use a very simple method for creating a new file: we will copy an input file to another name. Find or create a small input file (it is fine to use any small file from a previous exercise). Create a submit file that transfers the input file and copies it to another name (as if doing /bin/cp input.txt output.txt on the command line) Make the output filename different than any filenames that are in your submit directory What is the executable line? What is the arguments line? How do you tell HTCondor to transfer the input file? As always, use output , error , and log filenames that are different from previous exercises Submit the job and wait for it to finish. What happened? Can you tell what HTCondor did with the output file that was created (did it end up back on the access point?), after it was created in the job sandbox? Look carefully at the list of files in your submit directory now. Transferring Specific Output Files \u00b6 As you saw in the last exercise, by default HTCondor transfers files that are created in the job sandbox back to the submit directory when the job finishes. In fact, HTCondor will also transfer back changed input files, too. But, this only works for files that are in the top-level sandbox directory, and not for ones contained in subdirectories. What if you want to bring back only some output files, or output files contained in subdirectories? Here is a shell script that creates several files, including a copy of an input file in a new subdirectory: #!/bin/sh if [ $# -ne 1 ] ; then echo \"Usage: $0 INPUT\" ; exit 1 ; fi date > output-timestamp.txt cal > output-calendar.txt mkdir subdirectory cp $1 subdirectory/backup- $1 First, let\u2019s confirm that HTCondor does not bring back the output file (which starts with the prefix backup- ) in the subdirectory: Create a file called output.sh and save the above shell script in this file. Write a submit file that transfers any input file and runs output.sh on it (remember to include an arguments = line and pass the input filename as an argument). Submit the job, wait for it to finish, and examine the contents of your submit directory. Suppose you decide that you want only the timestamp output file and all files in the subdirectory, but not the calendar output file. You can tell HTCondor to only transfer these specific files back to the submission directory using transfer_output_files = : transfer_output_files = output-timestamp.txt, subdirectory/ When using transfer_output_files = , HTCondor will only transfer back the files you name - all other files will be ignored and deleted at the end of a job. Note See the trailing slash ( / ) on the subdirectory? That tells HTCondor to transfer back the files contained in the subdirectory, but not the directory itself ; the files will be written directly into the submit directory. If you want HTCondor to transfer back an entire directory, leave off the trailing slash. Remove all output files from the previous run, including output-timestamp.txt and output-calendar.txt . Copy the previous submit file that ran output.sh and add the transfer_output_files line from above. Submit the job, wait for it to finish, and examine the contents of your submit directory. Did it work as you expected? Thinking About Progress So Far \u00b6 At this point, you can do just about everything that you need in order to run jobs on a HTC pool. You can identify the executable, arguments, and input files, and you can get output back from the job. This is a big achievement! References \u00b6 There are many more details about HTCondor\u2019s file transfer mechanism not covered here. For more information, read \"Submitting Jobs Without a Shared Filesystem\" in the HTCondor Manual.","title":"2.1 - Work with input and output files"},{"location":"materials/htcondor/part2-ex1-files/#htc-exercise-21-work-with-input-and-output-files","text":"","title":"HTC Exercise 2.1: Work With Input and Output Files"},{"location":"materials/htcondor/part2-ex1-files/#exercise-goal","text":"The goal of this exercise is make input files available to your job on the execute machine and to return output files back created in your job back to you on the access point. This small change significantly adds to the kinds of jobs that you can run.","title":"Exercise Goal"},{"location":"materials/htcondor/part2-ex1-files/#viewing-a-job-sandbox","text":"Before you learn to transfer files to and from your job, it is good to understand a bit more about the environment in which your job runs. When the HTCondor starter process prepares to run your job, it creates a new directory for your job and all of its files. We call this directory the job sandbox , because it is your job\u2019s private space to play. Let\u2019s see what is in the job sandbox for a minimal job with no special input or output files. Save the script below in a file named sandbox.sh : #!/bin/sh echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'Sandbox: ' ` pwd ` ls -alF # END Create a submit file for this script and submit it. When the job finishes, look at the contents of the output file. In the output file, note the Sandbox: line: That is the full path to your job sandbox for the run. It was created just for your job, and it was removed as soon as your job finished. Next, look at the output that appears after the Sandbox: line; it is the output from the ls command in the script. It shows all of the files in your job sandbox, as they existed at the end of the execution of sandbox.sh . The number of files that you see can change depending on the HTC system you are using, but some of the files you should always see are: .chirp.config Configuration for an advanced feature sandbox.sh Your executable .job.ad The job ClassAd .machine.ad The machine ClassAd _condor_stderr Saved standard error from the job _condor_stdout Saved standard output from the job tmp/ , var/tmp/ Directories in which to put temporary files So, HTCondor wrote copies of the job and machine ads (for use by the job, if desired), transferred your executable ( sandbox.sh ), ran it, and saved its standard output and standard error into files. Notice that your submit file, which was in the same directory on the access point machine as your executable, was not transferred, nor were any other files that happened to be in directory with the submit file. Now that we know something about the sandbox, we can transfer more files to and from it.","title":"Viewing a Job Sandbox"},{"location":"materials/htcondor/part2-ex1-files/#running-a-job-with-input-files","text":"Next, you will run a job that requires an input file. Remember, the initial job sandbox will contain only the job executable, unless you tell HTCondor explicitly about every other file that needs to be transferred to the job. Here is a Python script that takes the name of an input file (containing one word per line) from the command line, counts the number of times each (lowercased) word occurs in the text, and prints out the final list of words and their counts. #!/usr/bin/env python3 import os import sys if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' , encoding = 'iso-8859-1' ) as my_file : for line in my_file : word = line . strip () . lower () if word in words : words [ word ] += 1 else : words [ word ] = 1 for word in sorted ( words . keys ()): print ( f ' { words [ word ] : 8d } { word } ' ) Create and save the Python script in a file named freq.py . Download the input file for the script (263K lines, ~1.4 MB) and save it in your submit directory: username@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/intro-2.1-words.txt Create a submit file for the freq.py executable. Add a line called transfer_input_files = to tell HTCondor to transfer the input file to the job: transfer_input_files = intro-2.1-words.txt As with all submit file commands, it does not matter where this line goes, as long as it comes before the word queue . Since we want HTCondor to pass an argument to our Python executable, we need to remember to add an arguments = line in our submit file so that HTCondor knows to pass an argument to the job. Set this arguments = line equal to the argument to the Python script (i.e., the name the input file). Submit the job to HTCondor, wait for it to finish, and check the output! If things do not work the first time, keep trying! At this point in the exercises, we are telling you less and less explicitly how to do steps that you have done before. If you get stuck, ask for help in the Slack channel. Note If you want to transfer more than one input file, list all of them on a single transfer_input_files command, separated by commas. For example, if there are three input files: transfer_input_files = a.txt, b.txt, c.txt","title":"Running a Job With Input Files"},{"location":"materials/htcondor/part2-ex1-files/#transferring-output-files","text":"So far, we have relied on programs that send their output to the standard output and error streams, which HTCondor captures, saves, and returns back to the submit directory. But what if your program writes one or more files for its output? How do you tell HTCondor to bring them back? Let\u2019s start by exploring what happens to files that a job creates in the sandbox. We will use a very simple method for creating a new file: we will copy an input file to another name. Find or create a small input file (it is fine to use any small file from a previous exercise). Create a submit file that transfers the input file and copies it to another name (as if doing /bin/cp input.txt output.txt on the command line) Make the output filename different than any filenames that are in your submit directory What is the executable line? What is the arguments line? How do you tell HTCondor to transfer the input file? As always, use output , error , and log filenames that are different from previous exercises Submit the job and wait for it to finish. What happened? Can you tell what HTCondor did with the output file that was created (did it end up back on the access point?), after it was created in the job sandbox? Look carefully at the list of files in your submit directory now.","title":"Transferring Output Files"},{"location":"materials/htcondor/part2-ex1-files/#transferring-specific-output-files","text":"As you saw in the last exercise, by default HTCondor transfers files that are created in the job sandbox back to the submit directory when the job finishes. In fact, HTCondor will also transfer back changed input files, too. But, this only works for files that are in the top-level sandbox directory, and not for ones contained in subdirectories. What if you want to bring back only some output files, or output files contained in subdirectories? Here is a shell script that creates several files, including a copy of an input file in a new subdirectory: #!/bin/sh if [ $# -ne 1 ] ; then echo \"Usage: $0 INPUT\" ; exit 1 ; fi date > output-timestamp.txt cal > output-calendar.txt mkdir subdirectory cp $1 subdirectory/backup- $1 First, let\u2019s confirm that HTCondor does not bring back the output file (which starts with the prefix backup- ) in the subdirectory: Create a file called output.sh and save the above shell script in this file. Write a submit file that transfers any input file and runs output.sh on it (remember to include an arguments = line and pass the input filename as an argument). Submit the job, wait for it to finish, and examine the contents of your submit directory. Suppose you decide that you want only the timestamp output file and all files in the subdirectory, but not the calendar output file. You can tell HTCondor to only transfer these specific files back to the submission directory using transfer_output_files = : transfer_output_files = output-timestamp.txt, subdirectory/ When using transfer_output_files = , HTCondor will only transfer back the files you name - all other files will be ignored and deleted at the end of a job. Note See the trailing slash ( / ) on the subdirectory? That tells HTCondor to transfer back the files contained in the subdirectory, but not the directory itself ; the files will be written directly into the submit directory. If you want HTCondor to transfer back an entire directory, leave off the trailing slash. Remove all output files from the previous run, including output-timestamp.txt and output-calendar.txt . Copy the previous submit file that ran output.sh and add the transfer_output_files line from above. Submit the job, wait for it to finish, and examine the contents of your submit directory. Did it work as you expected?","title":"Transferring Specific Output Files"},{"location":"materials/htcondor/part2-ex1-files/#thinking-about-progress-so-far","text":"At this point, you can do just about everything that you need in order to run jobs on a HTC pool. You can identify the executable, arguments, and input files, and you can get output back from the job. This is a big achievement!","title":"Thinking About Progress So Far"},{"location":"materials/htcondor/part2-ex1-files/#references","text":"There are many more details about HTCondor\u2019s file transfer mechanism not covered here. For more information, read \"Submitting Jobs Without a Shared Filesystem\" in the HTCondor Manual.","title":"References"},{"location":"materials/htcondor/part2-ex2-queue-n/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 2.2: Use queue N , $(Cluster), and $(Process) \u00b6 Background \u00b6 Suppose you have a program that you want to run many times with different arguments each time. With what you know so far, you have a couple of choices: Write one submit file; submit one job, change the argument in the submit file, submit another job, change the submit file, \u2026 Write many submit files that are nearly identical except for the program argument Neither of these options seems very satisfying. Fortunately, HTCondor's queue statement is here to help! Exercise Goal \u00b6 The goal of the next several exercises is to learn to submit many jobs from a single HTCondor queue statement, and to control things like filenames and arguments on a per-job basis when doing so. Running Many Jobs With One queue Statement \u00b6 Example Here is a C program that uses a stochastic (random) method to estimate the value of \u03c0. The single argument to the program is the number of samples to take. More samples should result in better estimates! #include #include #include int main ( int argc , char * argv []) { struct timeval my_timeval ; int iterations = 0 ; int inside_circle = 0 ; int i ; double x , y , pi_estimate ; gettimeofday ( & my_timeval , NULL ); srand48 ( my_timeval . tv_sec ^ my_timeval . tv_usec ); if ( argc == 2 ) { iterations = atoi ( argv [ 1 ]); } else { printf ( \"usage: circlepi ITERATIONS \\n \" ); exit ( 1 ); } for ( i = 0 ; i < iterations ; i ++ ) { x = ( drand48 () - 0.5 ) * 2.0 ; y = ( drand48 () - 0.5 ) * 2.0 ; if ((( x * x ) + ( y * y )) <= 1.0 ) { inside_circle ++ ; } } pi_estimate = 4.0 * (( double ) inside_circle / ( double ) iterations ); printf ( \"%d iterations, %d inside; pi = %f \\n \" , iterations , inside_circle , pi_estimate ); return 0 ; } In a new directory for this exercise, create and save the code to a file named circlepi.c Compile the code (we will cover this in more detail during the Software lecture): username@ap1 $ gcc -o circlepi circlepi.c Test the program with just 1000 samples: username@ap1 $ ./circlepi 1000 Now suppose that you want to run the program many times, to produce many estimates. To do so, we can tell HTCondor how many jobs to \"queue up\" via the queue statement we've been putting at the end of each of our submit files. Let\u2019s see how it works: Write a normal submit file for this program Pass 1 million ( 1000000 ) as the command line argument to circlepi Make sure to include log , output , and error (with filenames like circlepi.log ), and request_* lines At the end of the file, write queue 3 instead of just queue (\"queue 3 jobs\" vs. \"queue a job\"). Submit the file. Note the slightly different message from condor_submit : 3 job(s) submitted to cluster *NNNN*. Before the jobs execute, look at the job queue to see the multiple jobs Here is some sample condor_q -nobatch output: ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 10228.0 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.1 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.2 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 In this sample, all three jobs are part of cluster 10228 , but the first job was assigned process 0 , the second job was assigned process 1 , and the third one was assigned process 2 . (Programmers like to start counting from 0.) Now we can understand what the first column in the output, the job ID , represents. It is a job\u2019s cluster number , a dot ( . ), and the job\u2019s process number . So in the example above, the job ID of the second job is 10228.1 . Pop Quiz: Do you remember how to ask HTCondor's queue to list the status of all of the jobs from one cluster? How about one specific job ID? Using queue N With Output \u00b6 When all three jobs in your single cluster are finished, examine the resulting files. What is in the output file? What is in the error file? (hopefully it is empty!) What is in the log file? Look carefully at the job IDs in each event. Is this what you expected? Is it what you wanted? If the output is not what you expected, what do you think happened? Using $(Process) to Distinguish Jobs \u00b6 As you saw with the experiment above, each job ended up overwriting the same output and error filenames in the submission directory. After all, we didn't tell it to behave any differently when it ran three jobs. We need a way to separate output (and error) files per job that is queued , not just for the whole cluster of jobs. Fortunately, HTCondor has a way to separate the files easily. When processing a submit file, HTCondor will replace any instance of $(Process) with the process number of the job, for each job that is queued. For example, you can use the $(Process) variable to define a separate output file name for each job: output = my-output-file-$(Process).out queue 10 Even though the output filename is defined only once, HTCondor will create separate output filenames for each job: First job my-output-file-0.out Second job my-output-file-1.out Third job my-output-file-2.out ... ... Last (tenth) job my-output-file-9.out Let\u2019s see how this works for our program that estimates \u03c0. In your submit file, change the definitions of output and error to use $(Process) in the filename, similar to the example above. Delete any standard output, standard error, and log files from previous runs. Submit the updated file. When all three jobs are finished, examine the resulting files again. How many files are there of each type? What are their names? Is this what you expected? Is it what you wanted from the \u03c0 estimation process? Using $(Cluster) to Separate Files Across Runs \u00b6 With $(Process) , you can get separate output (and error) filenames for each job within a run. However, the next time you submit the same file, all of the output and error files are overwritten by new ones created by the new jobs. Maybe this is the behavior that you want. But sometimes, you may want to separate files by run, as well. In addition to $(Process) , there is also a $(Cluster) variable that you can use in your submit files. It works just like $(Process) , except it is replaced with the cluster number of the entire submission. Because the cluster number is the same for all jobs within a single submission, it does not separate files by job within a submission. But when used with $(Process) , it can be used to separate files by run. For example, consider this output statement: output = my-output-file-$(Cluster)-$(Process).out For one particular run, it might result in output filenames like my-output-file-2444-0.out , myoutput-file-2444-1.out , myoutput-file-2444-2.out , etc. However, the next run would have different filenames, replacing 2444 with the new Cluster number of that run. Using $(Process) and $(Cluster) in Other Statements \u00b6 The $(Cluster) and $(Process) variables can be used in any submit file statement, although they are useful in some kinds of submit file statements and not really for others. For example, consider using $(Cluster) or $(Process) in each of the below: log transfer_input_files transfer_output_files arguments Unfortunately, HTCondor does not easily let you perform math on the $(Process) number when using it. So, for example, if you use $(Process) as a numeric argument to a command, it will always result in jobs getting the arguments 0, 1, 2, and so on. If you have control over your program and the way in which it uses command-line arguments, then you are fine. Otherwise, you might need a solution like those in the next exercises. (Optional) Defining JobBatchName for Tracking \u00b6 It is possible to define arbitrary attributes in your submit file, and that one purpose of such attributes is to track or report on different jobs separately. In this optional exercise, you will see how this technique can be used. Once again, we will use sleep jobs, so that your jobs remain in the queue long enough to experiment on. Create a submit file that runs sleep 120 . Instead of a single queue statement, write this: jobbatchname = 1 queue 5 Submit the submit file to HTCondor. Now, quickly edit the submit file to instead say: jobbatchname = 2 Submit the file again. Check on the submissions using a normal condor_q and condor_q -nobatch . Of course, your special attribute does not appear in the condor_q -nobatch output, but it is present in the condor_q output and in each job\u2019s ClassAd. You can see the effect of the attribute by limiting your condor_q output to one type of job or another. First, run this command: username@ap1 $ condor_q -constraint 'JobBatchName == \"1\"' Do you get the output that you expected? Using the example command above, how would you list your other five jobs? (There will be more on how to use HTCondor constraints in later exercises.)","title":"2.2 - Use queue N, $(Cluster), and $(Process)"},{"location":"materials/htcondor/part2-ex2-queue-n/#htc-exercise-22-use-queue-n-cluster-and-process","text":"","title":"HTC Exercise 2.2: Use queue N, $(Cluster), and $(Process)"},{"location":"materials/htcondor/part2-ex2-queue-n/#background","text":"Suppose you have a program that you want to run many times with different arguments each time. With what you know so far, you have a couple of choices: Write one submit file; submit one job, change the argument in the submit file, submit another job, change the submit file, \u2026 Write many submit files that are nearly identical except for the program argument Neither of these options seems very satisfying. Fortunately, HTCondor's queue statement is here to help!","title":"Background"},{"location":"materials/htcondor/part2-ex2-queue-n/#exercise-goal","text":"The goal of the next several exercises is to learn to submit many jobs from a single HTCondor queue statement, and to control things like filenames and arguments on a per-job basis when doing so.","title":"Exercise Goal"},{"location":"materials/htcondor/part2-ex2-queue-n/#running-many-jobs-with-one-queue-statement","text":"Example Here is a C program that uses a stochastic (random) method to estimate the value of \u03c0. The single argument to the program is the number of samples to take. More samples should result in better estimates! #include #include #include int main ( int argc , char * argv []) { struct timeval my_timeval ; int iterations = 0 ; int inside_circle = 0 ; int i ; double x , y , pi_estimate ; gettimeofday ( & my_timeval , NULL ); srand48 ( my_timeval . tv_sec ^ my_timeval . tv_usec ); if ( argc == 2 ) { iterations = atoi ( argv [ 1 ]); } else { printf ( \"usage: circlepi ITERATIONS \\n \" ); exit ( 1 ); } for ( i = 0 ; i < iterations ; i ++ ) { x = ( drand48 () - 0.5 ) * 2.0 ; y = ( drand48 () - 0.5 ) * 2.0 ; if ((( x * x ) + ( y * y )) <= 1.0 ) { inside_circle ++ ; } } pi_estimate = 4.0 * (( double ) inside_circle / ( double ) iterations ); printf ( \"%d iterations, %d inside; pi = %f \\n \" , iterations , inside_circle , pi_estimate ); return 0 ; } In a new directory for this exercise, create and save the code to a file named circlepi.c Compile the code (we will cover this in more detail during the Software lecture): username@ap1 $ gcc -o circlepi circlepi.c Test the program with just 1000 samples: username@ap1 $ ./circlepi 1000 Now suppose that you want to run the program many times, to produce many estimates. To do so, we can tell HTCondor how many jobs to \"queue up\" via the queue statement we've been putting at the end of each of our submit files. Let\u2019s see how it works: Write a normal submit file for this program Pass 1 million ( 1000000 ) as the command line argument to circlepi Make sure to include log , output , and error (with filenames like circlepi.log ), and request_* lines At the end of the file, write queue 3 instead of just queue (\"queue 3 jobs\" vs. \"queue a job\"). Submit the file. Note the slightly different message from condor_submit : 3 job(s) submitted to cluster *NNNN*. Before the jobs execute, look at the job queue to see the multiple jobs Here is some sample condor_q -nobatch output: ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 10228.0 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.1 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.2 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 In this sample, all three jobs are part of cluster 10228 , but the first job was assigned process 0 , the second job was assigned process 1 , and the third one was assigned process 2 . (Programmers like to start counting from 0.) Now we can understand what the first column in the output, the job ID , represents. It is a job\u2019s cluster number , a dot ( . ), and the job\u2019s process number . So in the example above, the job ID of the second job is 10228.1 . Pop Quiz: Do you remember how to ask HTCondor's queue to list the status of all of the jobs from one cluster? How about one specific job ID?","title":"Running Many Jobs With One queue Statement"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-queue-n-with-output","text":"When all three jobs in your single cluster are finished, examine the resulting files. What is in the output file? What is in the error file? (hopefully it is empty!) What is in the log file? Look carefully at the job IDs in each event. Is this what you expected? Is it what you wanted? If the output is not what you expected, what do you think happened?","title":"Using queue N With Output"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-process-to-distinguish-jobs","text":"As you saw with the experiment above, each job ended up overwriting the same output and error filenames in the submission directory. After all, we didn't tell it to behave any differently when it ran three jobs. We need a way to separate output (and error) files per job that is queued , not just for the whole cluster of jobs. Fortunately, HTCondor has a way to separate the files easily. When processing a submit file, HTCondor will replace any instance of $(Process) with the process number of the job, for each job that is queued. For example, you can use the $(Process) variable to define a separate output file name for each job: output = my-output-file-$(Process).out queue 10 Even though the output filename is defined only once, HTCondor will create separate output filenames for each job: First job my-output-file-0.out Second job my-output-file-1.out Third job my-output-file-2.out ... ... Last (tenth) job my-output-file-9.out Let\u2019s see how this works for our program that estimates \u03c0. In your submit file, change the definitions of output and error to use $(Process) in the filename, similar to the example above. Delete any standard output, standard error, and log files from previous runs. Submit the updated file. When all three jobs are finished, examine the resulting files again. How many files are there of each type? What are their names? Is this what you expected? Is it what you wanted from the \u03c0 estimation process?","title":"Using $(Process) to Distinguish Jobs"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-cluster-to-separate-files-across-runs","text":"With $(Process) , you can get separate output (and error) filenames for each job within a run. However, the next time you submit the same file, all of the output and error files are overwritten by new ones created by the new jobs. Maybe this is the behavior that you want. But sometimes, you may want to separate files by run, as well. In addition to $(Process) , there is also a $(Cluster) variable that you can use in your submit files. It works just like $(Process) , except it is replaced with the cluster number of the entire submission. Because the cluster number is the same for all jobs within a single submission, it does not separate files by job within a submission. But when used with $(Process) , it can be used to separate files by run. For example, consider this output statement: output = my-output-file-$(Cluster)-$(Process).out For one particular run, it might result in output filenames like my-output-file-2444-0.out , myoutput-file-2444-1.out , myoutput-file-2444-2.out , etc. However, the next run would have different filenames, replacing 2444 with the new Cluster number of that run.","title":"Using $(Cluster) to Separate Files Across Runs"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-process-and-cluster-in-other-statements","text":"The $(Cluster) and $(Process) variables can be used in any submit file statement, although they are useful in some kinds of submit file statements and not really for others. For example, consider using $(Cluster) or $(Process) in each of the below: log transfer_input_files transfer_output_files arguments Unfortunately, HTCondor does not easily let you perform math on the $(Process) number when using it. So, for example, if you use $(Process) as a numeric argument to a command, it will always result in jobs getting the arguments 0, 1, 2, and so on. If you have control over your program and the way in which it uses command-line arguments, then you are fine. Otherwise, you might need a solution like those in the next exercises.","title":"Using $(Process) and $(Cluster) in Other Statements"},{"location":"materials/htcondor/part2-ex2-queue-n/#optional-defining-jobbatchname-for-tracking","text":"It is possible to define arbitrary attributes in your submit file, and that one purpose of such attributes is to track or report on different jobs separately. In this optional exercise, you will see how this technique can be used. Once again, we will use sleep jobs, so that your jobs remain in the queue long enough to experiment on. Create a submit file that runs sleep 120 . Instead of a single queue statement, write this: jobbatchname = 1 queue 5 Submit the submit file to HTCondor. Now, quickly edit the submit file to instead say: jobbatchname = 2 Submit the file again. Check on the submissions using a normal condor_q and condor_q -nobatch . Of course, your special attribute does not appear in the condor_q -nobatch output, but it is present in the condor_q output and in each job\u2019s ClassAd. You can see the effect of the attribute by limiting your condor_q output to one type of job or another. First, run this command: username@ap1 $ condor_q -constraint 'JobBatchName == \"1\"' Do you get the output that you expected? Using the example command above, how would you list your other five jobs? (There will be more on how to use HTCondor constraints in later exercises.)","title":"(Optional) Defining JobBatchName for Tracking"},{"location":"materials/htcondor/part2-ex3-queue-from/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 2.3: Submit with \u201cqueue from\u201d \u00b6 Exercise Goals \u00b6 In this exercise and the next one, you will explore more ways to use a single submit file to submit many jobs . The goal of this exercise is to submit many jobs from a single submit file by using the queue ... from syntax to read variable values from a file. Background \u00b6 In all cases of submitting many jobs from a single submit file, the key questions are: What makes each job unique? In other words, there is one job per _____? So, how should you tell HTCondor to distinguish each job? For queue *N* , jobs are distinguished simply by the built-in \"process\" variable. But with the remaining queue forms, you help HTCondor distinguish jobs by other, more meaningful custom variables. Counting Words in Files \u00b6 Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. As mentioned in the lecture, HTCondor provides many ways to submit jobs for this task. You could create a separate submit file for each book, and submit all of the files manually, but you'd have a lot of file lines to modify each time (in particular, all five of the last lines before queue below): executable = freq.py request_memory = 1GB request_disk = 20MB should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = AAiW.txt arguments = AAiW.txt output = AAiW.out error = AAiW.err log = AAiW.log queue This would be overly verbose and tedious. Let's do better. Queue Jobs From a List of Values \u00b6 Suppose we want to modify our word-frequency analysis from a previous exercise so that it outputs only the most common N words of a document. However, we want to experiment with different values of N . For this analysis, we will have a new version of the word-frequency counting script. First, we need a new version of the word counting program so that it accepts an extra number as a command line argument and outputs only that many of the most common words. Here is the new code (it's still not important that you understand this code): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 3 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA NUM_WORDS' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] num_words = int ( sys . argv [ 2 ]) words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words [ - num_words :]: print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To submit this program with a collection of two variable values for each run, one for the number of top words and one for the filename: Save the script as wordcount-top-n.py . Download and unpack some books from Project Gutenberg: user@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/books.zip user@ap1 $ unzip books.zip Create a new submit file (or base it off a previous one!) named wordcount-top.sub , including memory and disk requests of 20 MB. All of the jobs will use the same executable and log statements. Update other statements to work with two variables, book and n : output = $(book)_top_$(n).out error = $(book)_top_$(n).err transfer_input_files = $(book) arguments = \"$(book) $(n)\" queue book, n from books_n.txt Note especially the changes to the queue statement; it now tells HTCondor to read a separate text file of pairs of values, which will be assigned to book and n respectively. Create the separate text file of job variable values and save it as books_n.txt : AAiW.txt, 10 AAiW.txt, 25 AAiW.txt, 50 PandP.txt, 10 PandP.txt, 25 PandP.txt, 50 TAoSH.txt, 10 TAoSH.txt, 25 TAoSH.txt, 50 Note that we used 3 different values for n for each book. Submit the file Do a quick sanity check: How many jobs were submitted? How many log, output, and error files were created? Extra Challenge 1 \u00b6 You may have noticed that the output of these jobs has a messy naming convention. Because our macros resolve to the filenames, including their extension (e.g., AAiW.txt ), the output filenames contain with multiple extensions (e.g., AAiW.txt.err ). Although the extra extension is acceptable, it makes the filenames harder to read and possibly organize. Change your submit file and variable file for this exercise so that the output filenames do not include the .txt extension.","title":"2.3 - Use queue from with custom variables"},{"location":"materials/htcondor/part2-ex3-queue-from/#htc-exercise-23-submit-with-queue-from","text":"","title":"HTC Exercise 2.3: Submit with \u201cqueue from\u201d"},{"location":"materials/htcondor/part2-ex3-queue-from/#exercise-goals","text":"In this exercise and the next one, you will explore more ways to use a single submit file to submit many jobs . The goal of this exercise is to submit many jobs from a single submit file by using the queue ... from syntax to read variable values from a file.","title":"Exercise Goals"},{"location":"materials/htcondor/part2-ex3-queue-from/#background","text":"In all cases of submitting many jobs from a single submit file, the key questions are: What makes each job unique? In other words, there is one job per _____? So, how should you tell HTCondor to distinguish each job? For queue *N* , jobs are distinguished simply by the built-in \"process\" variable. But with the remaining queue forms, you help HTCondor distinguish jobs by other, more meaningful custom variables.","title":"Background"},{"location":"materials/htcondor/part2-ex3-queue-from/#counting-words-in-files","text":"Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. As mentioned in the lecture, HTCondor provides many ways to submit jobs for this task. You could create a separate submit file for each book, and submit all of the files manually, but you'd have a lot of file lines to modify each time (in particular, all five of the last lines before queue below): executable = freq.py request_memory = 1GB request_disk = 20MB should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = AAiW.txt arguments = AAiW.txt output = AAiW.out error = AAiW.err log = AAiW.log queue This would be overly verbose and tedious. Let's do better.","title":"Counting Words in Files"},{"location":"materials/htcondor/part2-ex3-queue-from/#queue-jobs-from-a-list-of-values","text":"Suppose we want to modify our word-frequency analysis from a previous exercise so that it outputs only the most common N words of a document. However, we want to experiment with different values of N . For this analysis, we will have a new version of the word-frequency counting script. First, we need a new version of the word counting program so that it accepts an extra number as a command line argument and outputs only that many of the most common words. Here is the new code (it's still not important that you understand this code): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 3 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA NUM_WORDS' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] num_words = int ( sys . argv [ 2 ]) words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words [ - num_words :]: print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To submit this program with a collection of two variable values for each run, one for the number of top words and one for the filename: Save the script as wordcount-top-n.py . Download and unpack some books from Project Gutenberg: user@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/books.zip user@ap1 $ unzip books.zip Create a new submit file (or base it off a previous one!) named wordcount-top.sub , including memory and disk requests of 20 MB. All of the jobs will use the same executable and log statements. Update other statements to work with two variables, book and n : output = $(book)_top_$(n).out error = $(book)_top_$(n).err transfer_input_files = $(book) arguments = \"$(book) $(n)\" queue book, n from books_n.txt Note especially the changes to the queue statement; it now tells HTCondor to read a separate text file of pairs of values, which will be assigned to book and n respectively. Create the separate text file of job variable values and save it as books_n.txt : AAiW.txt, 10 AAiW.txt, 25 AAiW.txt, 50 PandP.txt, 10 PandP.txt, 25 PandP.txt, 50 TAoSH.txt, 10 TAoSH.txt, 25 TAoSH.txt, 50 Note that we used 3 different values for n for each book. Submit the file Do a quick sanity check: How many jobs were submitted? How many log, output, and error files were created?","title":"Queue Jobs From a List of Values"},{"location":"materials/htcondor/part2-ex3-queue-from/#extra-challenge-1","text":"You may have noticed that the output of these jobs has a messy naming convention. Because our macros resolve to the filenames, including their extension (e.g., AAiW.txt ), the output filenames contain with multiple extensions (e.g., AAiW.txt.err ). Although the extra extension is acceptable, it makes the filenames harder to read and possibly organize. Change your submit file and variable file for this exercise so that the output filenames do not include the .txt extension.","title":"Extra Challenge 1"},{"location":"materials/htcondor/part2-ex4-queue-matching/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus HTC Exercise 2.4: Submit With \u201cqueue matching\u201d \u00b6 Exercise Goal \u00b6 The goal of this exercise is to submit many jobs from a single submit file by using the queue ... matching syntax to submit jobs with variable values derived from files in the current directory which match a specified pattern. Counting Words in Files \u00b6 Returning to our book word-counting example, let's pretend that instead of three books, we have an entire library. While we could list all of the text files in a books.txt file and use queue book from books.txt , it could be a tedious process, especially for tens of thousands of files. Luckily HTCondor provides a mechanism for submitting jobs based on pattern-matched files. Queue Jobs By Matching Filenames \u00b6 This is an example of a common scenario: We want to run one job per file, where the filenames match a certain consistent pattern. The queue ... matching statement is made for this scenario. Let\u2019s see this in action. First, here is a new version of the script (note, we removed the 'top n words' restriction): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words : print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To use the script: Create and save this script as wordcount.py . Verify the script by running it on one book manually. Create a new submit file to submit one job (pick a book file and model your submit file off of the one above) Modify the following submit file statements to work for all books: transfer_input_files = $(book) arguments = $(book) output = $(book).out error = $(book).err queue book matching *.txt Note As always, the order of statements in a submit file does not matter, except that the queue statement should be last. Also note that any submit file variable name (here, book , but true for process and all others) may be used in any mixture of upper- and lowercase letters. Submit the jobs. HTCondor uses the queue ... matching statement to look for files in the submit directory that match the given pattern, then queues one job per match. For each job, the given variable (e.g., book here) is assigned the name of the matching file, so that it can be used in output , error , and other statements. The result is the same as if we had written out a much longer submit file: ... transfer_input_files = AAiW.txt arguments = \"AAiW.txt\" output = AAiW.txt.out error = AAiW.txt.err queue transfer_input_files = PandP.txt arguments = \"PandP.txt\" output = PandP.txt.out error = PandP.txt.err queue transfer_input_files = TAoSH.txt arguments = \"TAoSH.txt\" output = TAoSH.txt.out error = TAoSH.txt.err queue ... How many jobs were created? Is this what you expected? If you ran this in the same directory as Exercise 2.3, you may have noticed that a job was submitted for the books_n.txt file that holds the variable values in the queue from statement. Beware the dangers of matching more files than intended! One solution may be to put all of the books into an books directory and queue matching books/*.txt . Can you think of other solutions? If you have time, try one! Extra Challenge 1 \u00b6 In the example above, you used a single log file for all three jobs. HTCondor handles this situation with no problem; each job writes its events into the log file without getting in the way of other events and other jobs. But as you may have seen, it may be difficult for a person to understand the events for any particular job in the combined log file. Create a new submit file that works just like the one above, except that each job writes its own log file. Extra Challenge 2 \u00b6 Between this exercise and the previous one, you have explored two of the three primary queue statements. How would you use the queue in ... list statement to accomplish the same thing(s) as one or both of the exercises?","title":"Bonus Exercise 2.4 - Use queue matching with a custom variable"},{"location":"materials/htcondor/part2-ex4-queue-matching/#bonus-htc-exercise-24-submit-with-queue-matching","text":"","title":"Bonus HTC Exercise 2.4: Submit With \u201cqueue matching\u201d"},{"location":"materials/htcondor/part2-ex4-queue-matching/#exercise-goal","text":"The goal of this exercise is to submit many jobs from a single submit file by using the queue ... matching syntax to submit jobs with variable values derived from files in the current directory which match a specified pattern.","title":"Exercise Goal"},{"location":"materials/htcondor/part2-ex4-queue-matching/#counting-words-in-files","text":"Returning to our book word-counting example, let's pretend that instead of three books, we have an entire library. While we could list all of the text files in a books.txt file and use queue book from books.txt , it could be a tedious process, especially for tens of thousands of files. Luckily HTCondor provides a mechanism for submitting jobs based on pattern-matched files.","title":"Counting Words in Files"},{"location":"materials/htcondor/part2-ex4-queue-matching/#queue-jobs-by-matching-filenames","text":"This is an example of a common scenario: We want to run one job per file, where the filenames match a certain consistent pattern. The queue ... matching statement is made for this scenario. Let\u2019s see this in action. First, here is a new version of the script (note, we removed the 'top n words' restriction): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words : print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To use the script: Create and save this script as wordcount.py . Verify the script by running it on one book manually. Create a new submit file to submit one job (pick a book file and model your submit file off of the one above) Modify the following submit file statements to work for all books: transfer_input_files = $(book) arguments = $(book) output = $(book).out error = $(book).err queue book matching *.txt Note As always, the order of statements in a submit file does not matter, except that the queue statement should be last. Also note that any submit file variable name (here, book , but true for process and all others) may be used in any mixture of upper- and lowercase letters. Submit the jobs. HTCondor uses the queue ... matching statement to look for files in the submit directory that match the given pattern, then queues one job per match. For each job, the given variable (e.g., book here) is assigned the name of the matching file, so that it can be used in output , error , and other statements. The result is the same as if we had written out a much longer submit file: ... transfer_input_files = AAiW.txt arguments = \"AAiW.txt\" output = AAiW.txt.out error = AAiW.txt.err queue transfer_input_files = PandP.txt arguments = \"PandP.txt\" output = PandP.txt.out error = PandP.txt.err queue transfer_input_files = TAoSH.txt arguments = \"TAoSH.txt\" output = TAoSH.txt.out error = TAoSH.txt.err queue ... How many jobs were created? Is this what you expected? If you ran this in the same directory as Exercise 2.3, you may have noticed that a job was submitted for the books_n.txt file that holds the variable values in the queue from statement. Beware the dangers of matching more files than intended! One solution may be to put all of the books into an books directory and queue matching books/*.txt . Can you think of other solutions? If you have time, try one!","title":"Queue Jobs By Matching Filenames"},{"location":"materials/htcondor/part2-ex4-queue-matching/#extra-challenge-1","text":"In the example above, you used a single log file for all three jobs. HTCondor handles this situation with no problem; each job writes its events into the log file without getting in the way of other events and other jobs. But as you may have seen, it may be difficult for a person to understand the events for any particular job in the combined log file. Create a new submit file that works just like the one above, except that each job writes its own log file.","title":"Extra Challenge 1"},{"location":"materials/htcondor/part2-ex4-queue-matching/#extra-challenge-2","text":"Between this exercise and the previous one, you have explored two of the three primary queue statements. How would you use the queue in ... list statement to accomplish the same thing(s) as one or both of the exercises?","title":"Extra Challenge 2"},{"location":"materials/osg/part1-ex1-login-scp/","text":"OSG Exercise 1.1: Log In to the OSPool Access Point \u00b6 The main goal of this exercise is to log in to an Open Science Pool Access Point so that you can start submitting jobs into the OSPool. But before doing that, you will first prepare a file on Monday\u2018s Access Point to copy to the OSPool Access Point. Then you will learn how to efficiently copy files between the Access Points. If you have trouble getting ssh access to the OSPool Access Point, ask the instructors right away! Gaining access is critical for all remaining exercises. Part 1: On the PATh Access Point \u00b6 The first few sections below are to be completed on ap1.facility.path-cc.io , the PATh Access Point. This is still the same Access Point you have been using since yesterday. Preparing files for transfer \u00b6 When transferring files between computers, it\u2019s best to limit the number of files as well as their size. Smaller files transfer more quickly and, if your network connection fails, restarting the transfer is less painful than it would be if you were transferring large files. Archiving tools (WinZip, 7zip, Archive Utility, etc.) can compress the size of your files and place them into a single, smaller archive file. The Unix tar command is a one-stop shop for creating, extracting, and viewing the contents of tar archives (called tarballs ). Its usage is as follows: To create a tarball named containing , use the following command: $ tar -czvf Where should end in .tar.gz and can be a list of any number of files and/or folders, separated by spaces. To extract the files from a tarball into the current directory: $ tar -xzvf To list the files within a tarball: $ tar -tzvf Comparing compressed sizes \u00b6 You can adjust the level of compression of tar by prepending your command with GZIP=-- , where can be either fast for the least compression, or best for the most compression (the default compression is between best and fast ). While still logged in to ap1.facility.path-cc.io : Create and change into a new folder for this exercise, for example osg-ex11 Use wget to download the following files from our web server: Text file: http://proxy.chtc.wisc.edu/SQUID/osgschool21/random_text Archive: http://proxy.chtc.wisc.edu/SQUID/osgschool21/pdbaa.tar.gz Image: http://proxy.chtc.wisc.edu/SQUID/osgschool21/obligatory_cat.jpg Use tar on each file and use ls -l to compare the sizes of the original file and the compressed version. Which files were compressed the least? Why? Part 2: On the Open Science Pool Access Point \u00b6 For many of the remaining exercises, you will be using an OSPool Access Point, ap40.uw.osg-htc.org , which submits jobs into the OSPool. To log in to the OSPool Access Point, use the same username (and SSH key, if you did that) as on ap1 . If you have any issues logging in to ap40.uw.osg-htc.org , please ask for help right away! So please ssh in to the server and take a look around: Log in using ssh USERNAME@ap40.uw.osg-htc.org (substitute your own username) Try some Linux and HTCondor commands; for example: Linux commands: hostname , pwd , ls , and so on What is the operating system? uname and (in this case) cat /etc/redhat-release HTCondor commands: condor_version , condor_q , condor_status -total Transferring files \u00b6 In the next exercise, you will submit the same kind of job as in the previous exercise. Wouldn\u2019t it be nice to copy the files instead of starting from scratch? And in general, being able to copy files between servers is helpful, so let\u2019s explore a way to do that. Using secure copy \u00b6 Secure copy ( scp ) is a command based on SSH that lets you securely copy files between two different servers. It takes similar arguments to the Unix cp command but also takes additional information about servers. Its general form is like this: scp ... [username@]: may be omitted if you want to copy your sources to your remote home directory and [username@] may be omitted if your usernames are the same across both servers. For example, if you are logged in to ap40.uw.osg-htc.org and wanted to copy the file foo from your current directory to your home directory on ap1.facility.path-cc.io , and if your usernames are the same on both servers, the command would look like this: $ scp foo ap1.facility.path-cc.io: Additionally, you could pull files from ap1.facility.path-cc.io to ap40.uw.osg-htc.org . The following command copies bar from your home directory on ap1.facility.path-cc.io to your current directory on ap40.uw.osg-htc.org ; and in this case, the username for ap1 is specified: $ scp USERNAME@ap1.facility.path-cc.io:bar . Also, you can copy folders between servers using the -r option. If you kept all your files from the HTCondor exercise 1.3 in a folder named htc-1.3 on ap1.facility.path-cc.io , you could use the following command to copy them to your home directory on ap40.uw.osg-htc.org : $ scp -r USERNAME@ap1.facility.path-cc.io:htc-1.3 . Secure copy to your laptop \u00b6 During your research, you may need to transfer output files from your submit server to inspect them on your personal computer, which can also be done with scp ! To use scp on your laptop, follow the instructions relevant to your computer\u2018s operating system: Mac and Linux users \u00b6 scp should be included by default and available via the terminal on both Mac and Linux operating systems. Windows users \u00b6 WinSCP is an scp client for Windows operating systems. Install WinSCP from https://winscp.net/eng/index.php Next exercise \u00b6 Once completed, move onto the next exercise: Running jobs in the OSG","title":"1.1 - Log in to the OSPool Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#osg-exercise-11-log-in-to-the-ospool-access-point","text":"The main goal of this exercise is to log in to an Open Science Pool Access Point so that you can start submitting jobs into the OSPool. But before doing that, you will first prepare a file on Monday\u2018s Access Point to copy to the OSPool Access Point. Then you will learn how to efficiently copy files between the Access Points. If you have trouble getting ssh access to the OSPool Access Point, ask the instructors right away! Gaining access is critical for all remaining exercises.","title":"OSG Exercise 1.1: Log In to the OSPool Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#part-1-on-the-path-access-point","text":"The first few sections below are to be completed on ap1.facility.path-cc.io , the PATh Access Point. This is still the same Access Point you have been using since yesterday.","title":"Part 1: On the PATh Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#preparing-files-for-transfer","text":"When transferring files between computers, it\u2019s best to limit the number of files as well as their size. Smaller files transfer more quickly and, if your network connection fails, restarting the transfer is less painful than it would be if you were transferring large files. Archiving tools (WinZip, 7zip, Archive Utility, etc.) can compress the size of your files and place them into a single, smaller archive file. The Unix tar command is a one-stop shop for creating, extracting, and viewing the contents of tar archives (called tarballs ). Its usage is as follows: To create a tarball named containing , use the following command: $ tar -czvf Where should end in .tar.gz and can be a list of any number of files and/or folders, separated by spaces. To extract the files from a tarball into the current directory: $ tar -xzvf To list the files within a tarball: $ tar -tzvf ","title":"Preparing files for transfer"},{"location":"materials/osg/part1-ex1-login-scp/#comparing-compressed-sizes","text":"You can adjust the level of compression of tar by prepending your command with GZIP=-- , where can be either fast for the least compression, or best for the most compression (the default compression is between best and fast ). While still logged in to ap1.facility.path-cc.io : Create and change into a new folder for this exercise, for example osg-ex11 Use wget to download the following files from our web server: Text file: http://proxy.chtc.wisc.edu/SQUID/osgschool21/random_text Archive: http://proxy.chtc.wisc.edu/SQUID/osgschool21/pdbaa.tar.gz Image: http://proxy.chtc.wisc.edu/SQUID/osgschool21/obligatory_cat.jpg Use tar on each file and use ls -l to compare the sizes of the original file and the compressed version. Which files were compressed the least? Why?","title":"Comparing compressed sizes"},{"location":"materials/osg/part1-ex1-login-scp/#part-2-on-the-open-science-pool-access-point","text":"For many of the remaining exercises, you will be using an OSPool Access Point, ap40.uw.osg-htc.org , which submits jobs into the OSPool. To log in to the OSPool Access Point, use the same username (and SSH key, if you did that) as on ap1 . If you have any issues logging in to ap40.uw.osg-htc.org , please ask for help right away! So please ssh in to the server and take a look around: Log in using ssh USERNAME@ap40.uw.osg-htc.org (substitute your own username) Try some Linux and HTCondor commands; for example: Linux commands: hostname , pwd , ls , and so on What is the operating system? uname and (in this case) cat /etc/redhat-release HTCondor commands: condor_version , condor_q , condor_status -total","title":"Part 2: On the Open Science Pool Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#transferring-files","text":"In the next exercise, you will submit the same kind of job as in the previous exercise. Wouldn\u2019t it be nice to copy the files instead of starting from scratch? And in general, being able to copy files between servers is helpful, so let\u2019s explore a way to do that.","title":"Transferring files"},{"location":"materials/osg/part1-ex1-login-scp/#using-secure-copy","text":"Secure copy ( scp ) is a command based on SSH that lets you securely copy files between two different servers. It takes similar arguments to the Unix cp command but also takes additional information about servers. Its general form is like this: scp ... [username@]: may be omitted if you want to copy your sources to your remote home directory and [username@] may be omitted if your usernames are the same across both servers. For example, if you are logged in to ap40.uw.osg-htc.org and wanted to copy the file foo from your current directory to your home directory on ap1.facility.path-cc.io , and if your usernames are the same on both servers, the command would look like this: $ scp foo ap1.facility.path-cc.io: Additionally, you could pull files from ap1.facility.path-cc.io to ap40.uw.osg-htc.org . The following command copies bar from your home directory on ap1.facility.path-cc.io to your current directory on ap40.uw.osg-htc.org ; and in this case, the username for ap1 is specified: $ scp USERNAME@ap1.facility.path-cc.io:bar . Also, you can copy folders between servers using the -r option. If you kept all your files from the HTCondor exercise 1.3 in a folder named htc-1.3 on ap1.facility.path-cc.io , you could use the following command to copy them to your home directory on ap40.uw.osg-htc.org : $ scp -r USERNAME@ap1.facility.path-cc.io:htc-1.3 .","title":"Using secure copy"},{"location":"materials/osg/part1-ex1-login-scp/#secure-copy-to-your-laptop","text":"During your research, you may need to transfer output files from your submit server to inspect them on your personal computer, which can also be done with scp ! To use scp on your laptop, follow the instructions relevant to your computer\u2018s operating system:","title":"Secure copy to your laptop"},{"location":"materials/osg/part1-ex1-login-scp/#mac-and-linux-users","text":"scp should be included by default and available via the terminal on both Mac and Linux operating systems.","title":"Mac and Linux users"},{"location":"materials/osg/part1-ex1-login-scp/#windows-users","text":"WinSCP is an scp client for Windows operating systems. Install WinSCP from https://winscp.net/eng/index.php","title":"Windows users"},{"location":"materials/osg/part1-ex1-login-scp/#next-exercise","text":"Once completed, move onto the next exercise: Running jobs in the OSG","title":"Next exercise"},{"location":"materials/osg/part1-ex2-submit-osg/","text":"OSG Exercise 1.2: Running Jobs in OSPool \u00b6 The goal of this exercise is to map the physical locations of some Execution Points in the OSPool. We will provide the executable and associated data, so your job will be to write a submit file that queues multiple jobs. Once complete, you will manually collate the results. Where in the world are my jobs? \u00b6 To find the physical location of the computers your jobs our running on, you will use a method called geolocation . Geolocation uses a registry to match a computer\u2019s network address to an approximate latitude and longitude. Geolocating several Execution Points \u00b6 Now, let\u2019s try to remember some basic HTCondor ideas from the HTC exercises: Log in to ap40.uw.osg-htc.org if you have not yet. Create and change into a new folder for this exercise, for example osg-ex12 Download the geolocation code: $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool21/location-wrapper.sh \\ http://proxy.chtc.wisc.edu/SQUID/osgschool21/wn-geoip.tar.gz You will be using location-wrapper.sh as your executable and wn-geoip.tar.gz as an input file. Create a submit file that queues fifty jobs that run location-wrapper.sh , transfers wn-geoip.tar.gz as an input file, and uses the $(Process) macro to write different output and error files. Also, add the following requirement to the submit file (it\u2019s not important to know what it does): requirements = (HAS_CVMFS_oasis_opensciencegrid_org == TRUE) && (IsOsgVoContainer =!= True) Try to do this step without looking at materials from the earlier exercises. But if you are stuck, see HTC Exercise 2.2 . Submit your jobs and wait for the results Collating your results \u00b6 Now that you have your results, it\u2019s time to summarize them. Rather than inspecting each output file individually, you can use the cat command to print the results from all of your output files at once. If all of your output files have the format location-#.out (e.g., location-10.out ), your command will look something like this: $ cat location-*.out The * is a wildcard so the above cat command runs on all files that start with location- and end in .out . Additionally, you can use cat in combination with the sort and uniq commands using \"pipes\" ( | ) to print only the unique results: $ cat location-*.out | sort | uniq Mapping your results \u00b6 To visualize the locations of the Execution Points that your jobs ran on, you will be using http://www.mapcustomizer.com/ . Copy and paste the collated results into the text box that pops up when clicking on the 'Bulk Entry' button on the right-hand side. Where did your jobs run? Next exercise \u00b6 Once completed, move onto the next exercise: Hardware Differences in the OSG Extra Challenge: Cleaning up your submit directory \u00b6 If you run ls in the directory from which you submitted your job, you may see that you now have thousands of files! Proper data management starts to become a requirement as you start to develop true HTC workflows; it may be helpful to separate your submit files, code, and input data from your output data. Try editing your submit file so that all your output and error files are saved to separate directories within your submit directory. Tip Experiment with fewer job submissions until you\u2019re confident you have it right, then go back to submitting 500 jobs. Remember: Test small and scale up! Submit your file and track the status of your jobs. Did your jobs complete successfully with output and error files saved in separate directories? If not, can you find any useful information in the job logs or hold messages? If you get stuck, review the slides from Tuesday .","title":"1.2 - Running jobs in the OSPool"},{"location":"materials/osg/part1-ex2-submit-osg/#osg-exercise-12-running-jobs-in-ospool","text":"The goal of this exercise is to map the physical locations of some Execution Points in the OSPool. We will provide the executable and associated data, so your job will be to write a submit file that queues multiple jobs. Once complete, you will manually collate the results.","title":"OSG Exercise 1.2: Running Jobs in OSPool"},{"location":"materials/osg/part1-ex2-submit-osg/#where-in-the-world-are-my-jobs","text":"To find the physical location of the computers your jobs our running on, you will use a method called geolocation . Geolocation uses a registry to match a computer\u2019s network address to an approximate latitude and longitude.","title":"Where in the world are my jobs?"},{"location":"materials/osg/part1-ex2-submit-osg/#geolocating-several-execution-points","text":"Now, let\u2019s try to remember some basic HTCondor ideas from the HTC exercises: Log in to ap40.uw.osg-htc.org if you have not yet. Create and change into a new folder for this exercise, for example osg-ex12 Download the geolocation code: $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool21/location-wrapper.sh \\ http://proxy.chtc.wisc.edu/SQUID/osgschool21/wn-geoip.tar.gz You will be using location-wrapper.sh as your executable and wn-geoip.tar.gz as an input file. Create a submit file that queues fifty jobs that run location-wrapper.sh , transfers wn-geoip.tar.gz as an input file, and uses the $(Process) macro to write different output and error files. Also, add the following requirement to the submit file (it\u2019s not important to know what it does): requirements = (HAS_CVMFS_oasis_opensciencegrid_org == TRUE) && (IsOsgVoContainer =!= True) Try to do this step without looking at materials from the earlier exercises. But if you are stuck, see HTC Exercise 2.2 . Submit your jobs and wait for the results","title":"Geolocating several Execution Points"},{"location":"materials/osg/part1-ex2-submit-osg/#collating-your-results","text":"Now that you have your results, it\u2019s time to summarize them. Rather than inspecting each output file individually, you can use the cat command to print the results from all of your output files at once. If all of your output files have the format location-#.out (e.g., location-10.out ), your command will look something like this: $ cat location-*.out The * is a wildcard so the above cat command runs on all files that start with location- and end in .out . Additionally, you can use cat in combination with the sort and uniq commands using \"pipes\" ( | ) to print only the unique results: $ cat location-*.out | sort | uniq","title":"Collating your results"},{"location":"materials/osg/part1-ex2-submit-osg/#mapping-your-results","text":"To visualize the locations of the Execution Points that your jobs ran on, you will be using http://www.mapcustomizer.com/ . Copy and paste the collated results into the text box that pops up when clicking on the 'Bulk Entry' button on the right-hand side. Where did your jobs run?","title":"Mapping your results"},{"location":"materials/osg/part1-ex2-submit-osg/#next-exercise","text":"Once completed, move onto the next exercise: Hardware Differences in the OSG","title":"Next exercise"},{"location":"materials/osg/part1-ex2-submit-osg/#extra-challenge-cleaning-up-your-submit-directory","text":"If you run ls in the directory from which you submitted your job, you may see that you now have thousands of files! Proper data management starts to become a requirement as you start to develop true HTC workflows; it may be helpful to separate your submit files, code, and input data from your output data. Try editing your submit file so that all your output and error files are saved to separate directories within your submit directory. Tip Experiment with fewer job submissions until you\u2019re confident you have it right, then go back to submitting 500 jobs. Remember: Test small and scale up! Submit your file and track the status of your jobs. Did your jobs complete successfully with output and error files saved in separate directories? If not, can you find any useful information in the job logs or hold messages? If you get stuck, review the slides from Tuesday .","title":"Extra Challenge: Cleaning up your submit directory"},{"location":"materials/osg/part1-ex3-hardware-diffs/","text":"OSG Exercise 1.3: Hardware Differences Between PATh and OSG \u00b6 The goal of this exercise is to compare hardware differences between the Monday cluster (the PATh Facility) and the Open Science Pool. Specifically, we will look at how easy it is to get access to resources in terms of the amount of memory that is requested. This will not be a very careful study, but should give you some idea of one way in which the pools are different. In the first two parts of the exercise, you will submit batches of jobs that differ only in how much memory each one requests. This is called this a parameter sweep , in that we are testing many possible values of a parameter. We will request memory from 8\u201364 GB, doubling the memory each time. One set of jobs will be submitted to the PATh Facility, and the other, identical set of jobs will be submitted to the OSPool. You will check the queue periodically to see how many jobs have completed and how many are still waiting to run. Checking PATh memory availability \u00b6 In this first part, you will create the submit file that will be used for both the PATh and OSPool jobs, then submit the PATh set. Yet another queue syntax \u00b6 Earlier, you learned about the queue statement and some of the different ways it can be invoked to submit multiple jobs. Similar to the queue from statement to submit jobs based on lines from a specific file, you can use queue in to submit jobs based on a list that is written directly in your submit file: queue <# of jobs> in ( ... ) For example, to submit 6 total jobs that sleep for 5 , 5 , 10 , 10 , 15 , and 15 seconds, you could write the following submit file: executable = /bin/sleep request_cpus = 1 request_memory = 1MB request_disk = 1MB queue 2 arguments in ( 5 10 15 ) Try submitting this yourself and verify that all six jobs are in the queue, using the condor_q -nobatch command. Create the submit file \u00b6 To create our parameter sweep, we will create a new submit file with the queue\u2026in syntax and change the value of our parameter ( request_memory ) for each batch of jobs. Log in or switch back to ap1.facility.path-cc.io (yes, back to PATh!) Create and change into a new subdirectory called osg-ex13 Create a submit file named sleep.sub that executes the command /bin/sleep 300 . Note If you do not remember all of the submit statements to write this file, or just to go faster, find a similar submit file from a previous exercise. Copy the file and rename it here, and make sure the argument to sleep is 300 . Use the queue\u2026in syntax to submit 10 jobs each for the following memory requests: 8, 16, 32, and 64 GB. There will be 40 jobs total: 10 jobs requesting 8 GB, 10 requesting 16 GB, etc. Submit your jobs Monitoring the local jobs \u00b6 Every few minutes, run condor_q and see how your sleep jobs are doing. To display the number of jobs remaining for each request_memory parameter specified, run the following command: $ condor_q -af RequestMemory | sort -n | uniq -c The numbers in the left column are the number of jobs left of that type and the number on the right is the amount of memory you requested, in MB. Consider making a little table like the one below to track progress. Memory Remaining #1 Remaining #2 Remaining #3 8 GB 10 6 16 GB 10 7 32 GB 10 8 64 GB 10 9 In the meantime, between checking on your local jobs, start the next section \u2013 but take a break every few minutes to switch back to ap1 and record progress on your PATh jobs. Checking OSPool memory availability \u00b6 Now you will do essentially the same thing on the OSPool. Log in or switch to ap40.uw.osg-htc.org Copy the osg-ex13 directory from the section above from ap1.facility.path-cc.io to ap40.uw.osg-htc.org If you get stuck during the copying process, refer to OSG exercise 1.1 . Submit the jobs to the OSPool Monitoring the remote jobs \u00b6 As you did in the first part, use condor_q to track how your sleep jobs are doing. It is fine to move on to the next exercise, but keep tracking the status of both sets of these jobs. After you are done with the next exercise , come back to this exercise and analyze the results. Analyzing the results \u00b6 Have all of your jobs from this exercise completed on both PATh and the OSPool? How many jobs have completed thus far on PATh? How many have completed thus far on the OSPool? Due to the dynamic nature of the OSPool, the demand for higher memory jobs there may have resulted in a temporary increase in high-memory slots there. That being said, high-memory are a high-demand, low-availability resource in the OSPool so your 64 GB jobs may have taken longer to run or complete. On the other hand, PATh has a fair number of 64 GB (and greater) slots so all your jobs have a high chance of running.","title":"1.3 - Hardware differences between PATh and OSG"},{"location":"materials/osg/part1-ex3-hardware-diffs/#osg-exercise-13-hardware-differences-between-path-and-osg","text":"The goal of this exercise is to compare hardware differences between the Monday cluster (the PATh Facility) and the Open Science Pool. Specifically, we will look at how easy it is to get access to resources in terms of the amount of memory that is requested. This will not be a very careful study, but should give you some idea of one way in which the pools are different. In the first two parts of the exercise, you will submit batches of jobs that differ only in how much memory each one requests. This is called this a parameter sweep , in that we are testing many possible values of a parameter. We will request memory from 8\u201364 GB, doubling the memory each time. One set of jobs will be submitted to the PATh Facility, and the other, identical set of jobs will be submitted to the OSPool. You will check the queue periodically to see how many jobs have completed and how many are still waiting to run.","title":"OSG Exercise 1.3: Hardware Differences Between PATh and OSG"},{"location":"materials/osg/part1-ex3-hardware-diffs/#checking-path-memory-availability","text":"In this first part, you will create the submit file that will be used for both the PATh and OSPool jobs, then submit the PATh set.","title":"Checking PATh memory availability"},{"location":"materials/osg/part1-ex3-hardware-diffs/#yet-another-queue-syntax","text":"Earlier, you learned about the queue statement and some of the different ways it can be invoked to submit multiple jobs. Similar to the queue from statement to submit jobs based on lines from a specific file, you can use queue in to submit jobs based on a list that is written directly in your submit file: queue <# of jobs> in ( ... ) For example, to submit 6 total jobs that sleep for 5 , 5 , 10 , 10 , 15 , and 15 seconds, you could write the following submit file: executable = /bin/sleep request_cpus = 1 request_memory = 1MB request_disk = 1MB queue 2 arguments in ( 5 10 15 ) Try submitting this yourself and verify that all six jobs are in the queue, using the condor_q -nobatch command.","title":"Yet another queue syntax"},{"location":"materials/osg/part1-ex3-hardware-diffs/#create-the-submit-file","text":"To create our parameter sweep, we will create a new submit file with the queue\u2026in syntax and change the value of our parameter ( request_memory ) for each batch of jobs. Log in or switch back to ap1.facility.path-cc.io (yes, back to PATh!) Create and change into a new subdirectory called osg-ex13 Create a submit file named sleep.sub that executes the command /bin/sleep 300 . Note If you do not remember all of the submit statements to write this file, or just to go faster, find a similar submit file from a previous exercise. Copy the file and rename it here, and make sure the argument to sleep is 300 . Use the queue\u2026in syntax to submit 10 jobs each for the following memory requests: 8, 16, 32, and 64 GB. There will be 40 jobs total: 10 jobs requesting 8 GB, 10 requesting 16 GB, etc. Submit your jobs","title":"Create the submit file"},{"location":"materials/osg/part1-ex3-hardware-diffs/#monitoring-the-local-jobs","text":"Every few minutes, run condor_q and see how your sleep jobs are doing. To display the number of jobs remaining for each request_memory parameter specified, run the following command: $ condor_q -af RequestMemory | sort -n | uniq -c The numbers in the left column are the number of jobs left of that type and the number on the right is the amount of memory you requested, in MB. Consider making a little table like the one below to track progress. Memory Remaining #1 Remaining #2 Remaining #3 8 GB 10 6 16 GB 10 7 32 GB 10 8 64 GB 10 9 In the meantime, between checking on your local jobs, start the next section \u2013 but take a break every few minutes to switch back to ap1 and record progress on your PATh jobs.","title":"Monitoring the local jobs"},{"location":"materials/osg/part1-ex3-hardware-diffs/#checking-ospool-memory-availability","text":"Now you will do essentially the same thing on the OSPool. Log in or switch to ap40.uw.osg-htc.org Copy the osg-ex13 directory from the section above from ap1.facility.path-cc.io to ap40.uw.osg-htc.org If you get stuck during the copying process, refer to OSG exercise 1.1 . Submit the jobs to the OSPool","title":"Checking OSPool memory availability"},{"location":"materials/osg/part1-ex3-hardware-diffs/#monitoring-the-remote-jobs","text":"As you did in the first part, use condor_q to track how your sleep jobs are doing. It is fine to move on to the next exercise, but keep tracking the status of both sets of these jobs. After you are done with the next exercise , come back to this exercise and analyze the results.","title":"Monitoring the remote jobs"},{"location":"materials/osg/part1-ex3-hardware-diffs/#analyzing-the-results","text":"Have all of your jobs from this exercise completed on both PATh and the OSPool? How many jobs have completed thus far on PATh? How many have completed thus far on the OSPool? Due to the dynamic nature of the OSPool, the demand for higher memory jobs there may have resulted in a temporary increase in high-memory slots there. That being said, high-memory are a high-demand, low-availability resource in the OSPool so your 64 GB jobs may have taken longer to run or complete. On the other hand, PATh has a fair number of 64 GB (and greater) slots so all your jobs have a high chance of running.","title":"Analyzing the results"},{"location":"materials/osg/part1-ex4-software-diffs/","text":"OSG Exercise 1.4: Software Differences in OSPool \u00b6 The goal of this exercise is to see some differences in the availability of software in the OSPool. At your local cluster, you may be used to having certain versions of software. But in the OSPool, it is likely that the software you need will not be available at all. Comparing operating systems \u00b6 To really see differences between Execution Points in the PATh Facility versus the OSPool, you will want to compare the \u201cmachine\u201d ClassAds between the two pools. Rather than inspecting the very long ClassAd for each Execution Point, you will look at a specific attribute called OpSysAndVer , which tells us the operating system version of the Execution Point. An easy way to show this attribute for all Execution Points is by using condor_status in conjunction with the -autoformat (or -af , for short) option. The -autoformat option is like the -format option you learned about earlier, and outputs the attributes you choose for each slot; but as you may have guessed, it does some automatic formatting for you. So, let\u2019s examine the operating system and (major) version of slots on the PATh Facility and the OSPool. Log in or switch to ap1.facility.path-cc.io and run the following command: $ condor_status -autoformat OpSysAndVer Log in or switch to ap40.uw.osg-htc.org (parallel windows are handy!) and run the same command You will see many values for the operating system and major version. Some are abbreviated \u2014 for example, RedHat stands for \u201cRed Hat Enterprise Linux\u201d and SL stands for \u201cScientific Linux\u201d (a Red Hat variant). The only problem is that with hundreds or thousands of slots, it's difficult to get a feel for the composition of each pool from this output. You can use the sort and uniq commands, in sequence, on the condor_status output to get counts of each unique operating system and version string. Your command line should look something like this: $ condor_status -autoformat OpSysAndVer | sort | uniq -c How would you describe the difference between the PATh Facility and OSPool? Submitting probe jobs \u00b6 Now you have some idea of the diversity of operating systems on the OSPool. This is a step in the right direction to knowing what software is available in general. But what you really want to know is whether your specific software tool (and version) is available. Software probe code \u00b6 The following shell script probes for software and returns the version if it is installed: #!/bin/sh get_version (){ program = $1 $program --version > /dev/null 2 > & 1 double_dash_rc = $? $program -version > /dev/null 2 > & 1 single_dash_rc = $? which $program > /dev/null 2 > & 1 which_rc = $? if [ $double_dash_rc -eq 0 ] ; then $program --version 2 > & 1 elif [ $single_dash_rc -eq 0 ] ; then $program -version 2 > & 1 elif [ $which_rc -eq 0 ] ; then echo \" $program installed but could not find version information\" else echo \" $program not installed\" fi } get_version 'R' get_version 'cmake' get_version 'python' get_version 'python3' If there's a specific command line program that your research requires, feel free to add it to the script! For example, if you wanted to test for the existence and version of nslookup , you would add the following to the end of the script: get_version 'nslookup' Probing several servers \u00b6 For this part of the exercise, try creating a submit file without referring to previous exercises! Log in or switch to ap40.uw.osg-htc.org Create and change into a new folder for this exercise, e.g. osg-ex14 Save the above script as a file named sw_probe.sh Make sure the script can be run: chmod a+x sw_probe.sh Try running the script in place to make sure it works: ./sw_probe.sh Create a submit file that runs sw_probe.sh 100 times and uses macros to write different output , error , and log files Submit your job and wait for the results Will you be able to do your research on the OSG with what's available? Do not worry if it does not seem like you can: Later today, you will learn how to make your jobs portable enough so that they can run anywhere!","title":"1.4 - Software differences in OSPool"},{"location":"materials/osg/part1-ex4-software-diffs/#osg-exercise-14-software-differences-in-ospool","text":"The goal of this exercise is to see some differences in the availability of software in the OSPool. At your local cluster, you may be used to having certain versions of software. But in the OSPool, it is likely that the software you need will not be available at all.","title":"OSG Exercise 1.4: Software Differences in OSPool"},{"location":"materials/osg/part1-ex4-software-diffs/#comparing-operating-systems","text":"To really see differences between Execution Points in the PATh Facility versus the OSPool, you will want to compare the \u201cmachine\u201d ClassAds between the two pools. Rather than inspecting the very long ClassAd for each Execution Point, you will look at a specific attribute called OpSysAndVer , which tells us the operating system version of the Execution Point. An easy way to show this attribute for all Execution Points is by using condor_status in conjunction with the -autoformat (or -af , for short) option. The -autoformat option is like the -format option you learned about earlier, and outputs the attributes you choose for each slot; but as you may have guessed, it does some automatic formatting for you. So, let\u2019s examine the operating system and (major) version of slots on the PATh Facility and the OSPool. Log in or switch to ap1.facility.path-cc.io and run the following command: $ condor_status -autoformat OpSysAndVer Log in or switch to ap40.uw.osg-htc.org (parallel windows are handy!) and run the same command You will see many values for the operating system and major version. Some are abbreviated \u2014 for example, RedHat stands for \u201cRed Hat Enterprise Linux\u201d and SL stands for \u201cScientific Linux\u201d (a Red Hat variant). The only problem is that with hundreds or thousands of slots, it's difficult to get a feel for the composition of each pool from this output. You can use the sort and uniq commands, in sequence, on the condor_status output to get counts of each unique operating system and version string. Your command line should look something like this: $ condor_status -autoformat OpSysAndVer | sort | uniq -c How would you describe the difference between the PATh Facility and OSPool?","title":"Comparing operating systems"},{"location":"materials/osg/part1-ex4-software-diffs/#submitting-probe-jobs","text":"Now you have some idea of the diversity of operating systems on the OSPool. This is a step in the right direction to knowing what software is available in general. But what you really want to know is whether your specific software tool (and version) is available.","title":"Submitting probe jobs"},{"location":"materials/osg/part1-ex4-software-diffs/#software-probe-code","text":"The following shell script probes for software and returns the version if it is installed: #!/bin/sh get_version (){ program = $1 $program --version > /dev/null 2 > & 1 double_dash_rc = $? $program -version > /dev/null 2 > & 1 single_dash_rc = $? which $program > /dev/null 2 > & 1 which_rc = $? if [ $double_dash_rc -eq 0 ] ; then $program --version 2 > & 1 elif [ $single_dash_rc -eq 0 ] ; then $program -version 2 > & 1 elif [ $which_rc -eq 0 ] ; then echo \" $program installed but could not find version information\" else echo \" $program not installed\" fi } get_version 'R' get_version 'cmake' get_version 'python' get_version 'python3' If there's a specific command line program that your research requires, feel free to add it to the script! For example, if you wanted to test for the existence and version of nslookup , you would add the following to the end of the script: get_version 'nslookup'","title":"Software probe code"},{"location":"materials/osg/part1-ex4-software-diffs/#probing-several-servers","text":"For this part of the exercise, try creating a submit file without referring to previous exercises! Log in or switch to ap40.uw.osg-htc.org Create and change into a new folder for this exercise, e.g. osg-ex14 Save the above script as a file named sw_probe.sh Make sure the script can be run: chmod a+x sw_probe.sh Try running the script in place to make sure it works: ./sw_probe.sh Create a submit file that runs sw_probe.sh 100 times and uses macros to write different output , error , and log files Submit your job and wait for the results Will you be able to do your research on the OSG with what's available? Do not worry if it does not seem like you can: Later today, you will learn how to make your jobs portable enough so that they can run anywhere!","title":"Probing several servers"},{"location":"materials/scaling/part1-ex1-organization/","text":"Organizing HTC Workloads \u00b6 Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. This exercise is similar to HTCondor exercise 2.4, in that it is about counting word frequencies in multiple files. But the focus here is on organizing the files more effectively on the Access Point, with an eye to scaling up to a larger HTC workload in the future. Log into an OSPool Access Point \u00b6 Make sure you are logged into ap40.uw.osg-htc.org . Get Files \u00b6 To get the files for this exercise: Type wget https://github.com/osg-htc/school-2024/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz to download the tarball. As you learned earlier, expand this tarball file; it will create a organizing-files directory. Change to that directory, or create a separate one for this exercise and copy the files in. Our Workload \u00b6 We can analyze one book by running the wordcount.py script, with the name of the book we want to analyze: $ ./wordcount.py Alice_in_Wonderland.txt Try running the command to see what the output is for the script. Once you have done that delete the output file created ( rm counts.Alice_in_Wonderland.txt ). We want to run this script on all the books we have copies of. What is the input set for this HTC workload? What is the output set? Make an Organization Plan \u00b6 Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the Access Point? There will also be system and HTCondor files produced when we submit a job \u2014 how would you organize the log, standard output, and standard error files? Try making those changes before moving on to the next section of the tutorial. Organize Files \u00b6 There are many different ways to organize files; a simple method that works for most workloads is having a directory for your input files and a directory for your output files. Set up this structure on the command line by running: $ mkdir input $ mv *.txt input/ $ mkdir output View the current directory and its subdirectories by using the ls command with the recursive ( -R ) flag: $ ls -R README.md books.submit input output wordcount.py ./input: Alice_in_Wonderland.txt Huckleberry_Finn.txt Dracula.txt Pride_and_Prejudice.txt ./output: Next, create directories for the HTCondor log, standard output, and standard output files (in one directory): $ mkdir logs $ mkdir errout Submit One Job \u00b6 Now we want to submit a test job that uses this organizing scheme, using just one item in our input set \u2014 in this example, we will use the Alice_in_Wonderland.txt file from our input directory. Fill in the incomplete lines of the submit file, as shown below: executable = wordcount.py arguments = Alice_in_Wonderland.txt transfer_input_files = input/Alice_in_Wonderland.txt transfer_output_files = counts.Alice_in_Wonderland.txt transfer_output_remaps = \"counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt\" To tell HTCondor the location of the input file, we need to include the input directory. Also, this submit file uses the transfer_output_remaps feature that you learned about; it will move the output file to the output directory by renaming or remapping it. Next, edit the submit file lines that tell the log, output, and error files where to go: output = logs/job.$(ClusterID).$(ProcID).out error = errout/job.$(ClusterID).$(ProcID).err log = errout/job.$(ClusterID).$(ProcID).log Submit your job and monitor its progress. Submit Multiple Jobs \u00b6 Now, you are ready to submit the whole workload. Create a file with the list of input files (the input set); here, this is the list of the book files to analyze. Do this by using the shell ls command and redirecting its output to a file: $ ls input > booklist.txt $ cat booklist.txt Modify the submit file to reference the file of inputs and replace the fixed value ( Alice_in_Wonderland.txt ) with a variable ( $(book) ): executable = wordcount.py arguments = $(book) transfer_input_files = input/$(book) transfer_output_files = counts.$(book) transfer_output_remaps = \"counts.$(book)=output/counts.$(book)\" queue book from booklist.txt Submit the jobs When complete, look at the complete set of input and (now) output files to see how they are organized.","title":"1.1 - Organizing HTC workloads"},{"location":"materials/scaling/part1-ex1-organization/#organizing-htc-workloads","text":"Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. This exercise is similar to HTCondor exercise 2.4, in that it is about counting word frequencies in multiple files. But the focus here is on organizing the files more effectively on the Access Point, with an eye to scaling up to a larger HTC workload in the future.","title":"Organizing HTC Workloads"},{"location":"materials/scaling/part1-ex1-organization/#log-into-an-ospool-access-point","text":"Make sure you are logged into ap40.uw.osg-htc.org .","title":"Log into an OSPool Access Point"},{"location":"materials/scaling/part1-ex1-organization/#get-files","text":"To get the files for this exercise: Type wget https://github.com/osg-htc/school-2024/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz to download the tarball. As you learned earlier, expand this tarball file; it will create a organizing-files directory. Change to that directory, or create a separate one for this exercise and copy the files in.","title":"Get Files"},{"location":"materials/scaling/part1-ex1-organization/#our-workload","text":"We can analyze one book by running the wordcount.py script, with the name of the book we want to analyze: $ ./wordcount.py Alice_in_Wonderland.txt Try running the command to see what the output is for the script. Once you have done that delete the output file created ( rm counts.Alice_in_Wonderland.txt ). We want to run this script on all the books we have copies of. What is the input set for this HTC workload? What is the output set?","title":"Our Workload"},{"location":"materials/scaling/part1-ex1-organization/#make-an-organization-plan","text":"Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the Access Point? There will also be system and HTCondor files produced when we submit a job \u2014 how would you organize the log, standard output, and standard error files? Try making those changes before moving on to the next section of the tutorial.","title":"Make an Organization Plan"},{"location":"materials/scaling/part1-ex1-organization/#organize-files","text":"There are many different ways to organize files; a simple method that works for most workloads is having a directory for your input files and a directory for your output files. Set up this structure on the command line by running: $ mkdir input $ mv *.txt input/ $ mkdir output View the current directory and its subdirectories by using the ls command with the recursive ( -R ) flag: $ ls -R README.md books.submit input output wordcount.py ./input: Alice_in_Wonderland.txt Huckleberry_Finn.txt Dracula.txt Pride_and_Prejudice.txt ./output: Next, create directories for the HTCondor log, standard output, and standard output files (in one directory): $ mkdir logs $ mkdir errout","title":"Organize Files"},{"location":"materials/scaling/part1-ex1-organization/#submit-one-job","text":"Now we want to submit a test job that uses this organizing scheme, using just one item in our input set \u2014 in this example, we will use the Alice_in_Wonderland.txt file from our input directory. Fill in the incomplete lines of the submit file, as shown below: executable = wordcount.py arguments = Alice_in_Wonderland.txt transfer_input_files = input/Alice_in_Wonderland.txt transfer_output_files = counts.Alice_in_Wonderland.txt transfer_output_remaps = \"counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt\" To tell HTCondor the location of the input file, we need to include the input directory. Also, this submit file uses the transfer_output_remaps feature that you learned about; it will move the output file to the output directory by renaming or remapping it. Next, edit the submit file lines that tell the log, output, and error files where to go: output = logs/job.$(ClusterID).$(ProcID).out error = errout/job.$(ClusterID).$(ProcID).err log = errout/job.$(ClusterID).$(ProcID).log Submit your job and monitor its progress.","title":"Submit One Job"},{"location":"materials/scaling/part1-ex1-organization/#submit-multiple-jobs","text":"Now, you are ready to submit the whole workload. Create a file with the list of input files (the input set); here, this is the list of the book files to analyze. Do this by using the shell ls command and redirecting its output to a file: $ ls input > booklist.txt $ cat booklist.txt Modify the submit file to reference the file of inputs and replace the fixed value ( Alice_in_Wonderland.txt ) with a variable ( $(book) ): executable = wordcount.py arguments = $(book) transfer_input_files = input/$(book) transfer_output_files = counts.$(book) transfer_output_remaps = \"counts.$(book)=output/counts.$(book)\" queue book from booklist.txt Submit the jobs When complete, look at the complete set of input and (now) output files to see how they are organized.","title":"Submit Multiple Jobs"},{"location":"materials/scaling/part1-ex2-job-attributes/","text":"Exercise 1.2: Investigating Job Attributes \u00b6 The objective of this exercise is to your awareness of job \"class ad attributes\", especially ones that may help you look for issues with your jobs in the OSPool. Recall that a job class ad contains attributes and their values that describe what HTCondor knows about the job. OSPool jobs contain extra attributes that are specific to that pool. Thus, an OSPool job class ad may have well over 150 attributes. Some OSPool job attributes are especially helpful when you are scaling up jobs and want to see if jobs are running as expected or are maybe doing surprising things that are worth extra attention. Preparing exercise files \u00b6 Because this exercise focuses on OSPool job attributes, please use your OSPool account on ap40.uw.osg-htc.org . Create a shell script for testing called simple.sh : #!/bin/bash SLEEPTIME=$1 hostname pwd whoami for i in {1..5} do echo \"performing iteration $i\" sleep $SLEEPTIME done Create an HTCondor submit file that queues three jobs: universe = vanilla log = logs/$(Cluster)_$(Process).log error = logs/$(Cluster)_$(Process).err output = $(Cluster)_$(Process).out executable = simple.sh should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1GB # set arguments, queue a normal job arguments = 600 queue 1 # queue a job that will go on hold transfer_input_files = test.txt queue 1 # queue a job that will never start request_memory = 40TB queue 1 Exploring OSPool job class ad attributes \u00b6 For this exercise, you will submit the three jobs defined in the submit file above, then examine their job class ad attributes. Here are some attributes that may be interesting: CpusProvisioned is the number of CPUs given to your job for the current or most recent run ResidentSetSize_RAW is the maximum amount of memory that HTCondor has noticed your job using (in KB) DiskUsage_RAW is the maximum amount of disk that HTCondor has noticed your job using (in MB) NumJobStarts is the number of times HTCondor has started your job; 1 is typical for a running job, and higher counts may indicate issues running the job LastRemoteHost identifies the name for the slot where your job is running or most recently ran MachineAttrGLIDEIN_ResourceName*N* is a set of numbered attributes that identify the most recent sites where your job ran; N is 0 for the most recent (or current) run, 1 for the previous run, and so on up to 9 ExitCode exists only if your job exited (completed) at least once; a value of 0 typically means success HoldReasonCode exists only if your job went on hold; if so, it is a number corresponding to the main hold reason (see here for details) NumHoldsByReason is a list of all of the main reasons your job has gone on hold so far with counts of each hold type Let\u2019s explore these attributes on real jobs. Submit the jobs (above) and note the cluster ID When one job from the cluster is running, view all of its job class ad attributes: $ condor_q -l where is your job's ID, and -l stands for -long This command lists all of the job\u2019s class ad attributes. Details of some of the attributes are in the HTcondor Manual . Others are defined (and not well documented) only for the OSPool. Can you find any of the attributes listed above? Next, use condor_q -af to examine one attribute at a time for several jobs: $ condor_q -af NumJobStarts where is the HTCondor cluster ID noted above, and -af stands for -autoformat . What does the output tell you? Finally, display several attributes at once for the jobs: $ condor_q -af:j NumJobStarts DiskUsage_RAW LastRemoteHost HoldReasonCode Why do some values appear as undefined ?","title":"1.2 - Investigating Job Attributes"},{"location":"materials/scaling/part1-ex2-job-attributes/#exercise-12-investigating-job-attributes","text":"The objective of this exercise is to your awareness of job \"class ad attributes\", especially ones that may help you look for issues with your jobs in the OSPool. Recall that a job class ad contains attributes and their values that describe what HTCondor knows about the job. OSPool jobs contain extra attributes that are specific to that pool. Thus, an OSPool job class ad may have well over 150 attributes. Some OSPool job attributes are especially helpful when you are scaling up jobs and want to see if jobs are running as expected or are maybe doing surprising things that are worth extra attention.","title":"Exercise 1.2: Investigating Job Attributes"},{"location":"materials/scaling/part1-ex2-job-attributes/#preparing-exercise-files","text":"Because this exercise focuses on OSPool job attributes, please use your OSPool account on ap40.uw.osg-htc.org . Create a shell script for testing called simple.sh : #!/bin/bash SLEEPTIME=$1 hostname pwd whoami for i in {1..5} do echo \"performing iteration $i\" sleep $SLEEPTIME done Create an HTCondor submit file that queues three jobs: universe = vanilla log = logs/$(Cluster)_$(Process).log error = logs/$(Cluster)_$(Process).err output = $(Cluster)_$(Process).out executable = simple.sh should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1GB # set arguments, queue a normal job arguments = 600 queue 1 # queue a job that will go on hold transfer_input_files = test.txt queue 1 # queue a job that will never start request_memory = 40TB queue 1","title":"Preparing exercise files"},{"location":"materials/scaling/part1-ex2-job-attributes/#exploring-ospool-job-class-ad-attributes","text":"For this exercise, you will submit the three jobs defined in the submit file above, then examine their job class ad attributes. Here are some attributes that may be interesting: CpusProvisioned is the number of CPUs given to your job for the current or most recent run ResidentSetSize_RAW is the maximum amount of memory that HTCondor has noticed your job using (in KB) DiskUsage_RAW is the maximum amount of disk that HTCondor has noticed your job using (in MB) NumJobStarts is the number of times HTCondor has started your job; 1 is typical for a running job, and higher counts may indicate issues running the job LastRemoteHost identifies the name for the slot where your job is running or most recently ran MachineAttrGLIDEIN_ResourceName*N* is a set of numbered attributes that identify the most recent sites where your job ran; N is 0 for the most recent (or current) run, 1 for the previous run, and so on up to 9 ExitCode exists only if your job exited (completed) at least once; a value of 0 typically means success HoldReasonCode exists only if your job went on hold; if so, it is a number corresponding to the main hold reason (see here for details) NumHoldsByReason is a list of all of the main reasons your job has gone on hold so far with counts of each hold type Let\u2019s explore these attributes on real jobs. Submit the jobs (above) and note the cluster ID When one job from the cluster is running, view all of its job class ad attributes: $ condor_q -l where is your job's ID, and -l stands for -long This command lists all of the job\u2019s class ad attributes. Details of some of the attributes are in the HTcondor Manual . Others are defined (and not well documented) only for the OSPool. Can you find any of the attributes listed above? Next, use condor_q -af to examine one attribute at a time for several jobs: $ condor_q -af NumJobStarts where is the HTCondor cluster ID noted above, and -af stands for -autoformat . What does the output tell you? Finally, display several attributes at once for the jobs: $ condor_q -af:j NumJobStarts DiskUsage_RAW LastRemoteHost HoldReasonCode Why do some values appear as undefined ?","title":"Exploring OSPool job class ad attributes"},{"location":"materials/scaling/part1-ex3-log-files/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Getting Job Information from Log Files \u00b6 HTCondor job log files contain useful information about submitted, running, and/or completed jobs, but the format of that information may not always be useful to you . Here, we have a few examples of how to use some powerful Unix commands ( grep , sort , uniq ) to pull information out of these job log files. It is now time for you to try these on your own jobs! Before starting this exercise, copy a couple of your job log files from previous exercises (for example, HTC Exercise 1.5 and/or OSG Exercise 1.1) in to a new directory for this exercise. Use these log files in place of my-job.log in the examples below. The grep command displays lines from a file matching a given pattern, where the pattern is the first argument provided to grep . For example grep 'alice' address_book.txt would print out all lines containing the characters alice in the file named address_book.txt . While working through this exercise, consider keeping one of your job log files open in a separate window to see if you can figure out how we came up with the patterns presented in this exercise. Job terminations \u00b6 Lines for job termination events in the job log always start with 005 and contain the timestamp of when the job(s) ended. Use the following grep command to get a list of when jobs ended in your log files: $ grep '^005' my-job.log Optional challenge : What is the importance of ^ in the pattern ( ^005 ) provided above? Recall that executables typically exit with code 0 when they exit normally, which often (but not always!) means that they exited successfully. Lines containing jobs' exit codes (i.e. return values) all contain the word termination . Use grep to get a list of jobs' exit codes: $ grep termination my-job.log By \"piping\" the output of the previous command through the sort and then uniq commands, we can get a count of each exit code: $ grep termination my-job.log | sort | uniq -c Here's an example of the output from the previous commands when run on a log file written to from eight jobs. Six jobs exited with exit code 0 , while two exited 1 : [username@ap40]$ grep '^005' my-job.log 005 (236881.000.000) 2022-07-27 15:07:38 Job terminated. 005 (236883.000.000) 2022-07-27 15:07:42 Job terminated. 005 (236882.000.000) 2022-07-27 15:08:01 Job terminated. 005 (236880.000.000) 2022-07-27 15:08:07 Job terminated. 005 (236891.000.000) 2022-07-27 15:13:31 Job terminated. 005 (236893.000.000) 2022-07-27 15:13:32 Job terminated. 005 (236892.000.000) 2022-07-27 15:13:58 Job terminated. 005 (236890.000.000) 2022-07-27 15:13:59 Job terminated. [username@ap40]$ grep 'termination' my-job.log (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 0) [username@ap40]$ grep 'termination' my-job.log | sort | uniq -c 6 (1) Normal termination (return value 0) 2 (1) Normal termination (return value 1) Job resource usage \u00b6 Jobs' resource usages (and requests and allocations) are logged in the following format: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 10382 1048576 1468671 Memory (MB) : 692 1024 1024 Run the following grep command to pull out the memory information from your job logs: $ grep 'Memory (MB) *:' my-job.log Look back at the format in the example above. Columns after the : will first show memory usage, then memory requested, and then the memory allocated to your job. Similarly, use the following command to get the disk information from your job logs: $ grep 'Disk (KB) *:' my-job.log Here's some example output from running the memory grep command on the same eight-job log file: [username@ap40]$ grep 'Memory (MB) *:' my-job.log Memory (MB) : 692 1024 1024 Memory (MB) : 714 1024 1024 Memory (MB) : 703 1024 1024 Memory (MB) : 699 1024 1024 Memory (MB) : 705 1024 1024 Memory (MB) : 704 1024 1024 Memory (MB) : 711 1024 1024 Memory (MB) : 697 1024 1024 In this example, the memory usage for the jobs ranged from 692 to 714 MB, and they all requested (and were allocated) 1 GB of memory. Other job information \u00b6 See if you can come up with grep commands to gather the number of bytes sent and received by jobs (i.e. how much data was transferred to/from the access point). Here is some example output for comparison: [username@ap40]$ grep '' my-job.log 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job [username@ap40]$ grep '' my-job.log 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job Job log files may also contain additional information about held jobs or interrupted jobs. If you feel that your jobs are bouncing from idle to running and back to idle, or that they are otherwise not making as much progress as you expect, the log files are a good place to check. Though they might eventually become impossibly large to read line-by-line once you start scaling up, using grep to pull out specific lines and using sort and uniq to reduce the output can help you make sense of the information contained in the logs.","title":"1.3 - Getting Job Information from Log Files"},{"location":"materials/scaling/part1-ex3-log-files/#getting-job-information-from-log-files","text":"HTCondor job log files contain useful information about submitted, running, and/or completed jobs, but the format of that information may not always be useful to you . Here, we have a few examples of how to use some powerful Unix commands ( grep , sort , uniq ) to pull information out of these job log files. It is now time for you to try these on your own jobs! Before starting this exercise, copy a couple of your job log files from previous exercises (for example, HTC Exercise 1.5 and/or OSG Exercise 1.1) in to a new directory for this exercise. Use these log files in place of my-job.log in the examples below. The grep command displays lines from a file matching a given pattern, where the pattern is the first argument provided to grep . For example grep 'alice' address_book.txt would print out all lines containing the characters alice in the file named address_book.txt . While working through this exercise, consider keeping one of your job log files open in a separate window to see if you can figure out how we came up with the patterns presented in this exercise.","title":"Getting Job Information from Log Files"},{"location":"materials/scaling/part1-ex3-log-files/#job-terminations","text":"Lines for job termination events in the job log always start with 005 and contain the timestamp of when the job(s) ended. Use the following grep command to get a list of when jobs ended in your log files: $ grep '^005' my-job.log Optional challenge : What is the importance of ^ in the pattern ( ^005 ) provided above? Recall that executables typically exit with code 0 when they exit normally, which often (but not always!) means that they exited successfully. Lines containing jobs' exit codes (i.e. return values) all contain the word termination . Use grep to get a list of jobs' exit codes: $ grep termination my-job.log By \"piping\" the output of the previous command through the sort and then uniq commands, we can get a count of each exit code: $ grep termination my-job.log | sort | uniq -c Here's an example of the output from the previous commands when run on a log file written to from eight jobs. Six jobs exited with exit code 0 , while two exited 1 : [username@ap40]$ grep '^005' my-job.log 005 (236881.000.000) 2022-07-27 15:07:38 Job terminated. 005 (236883.000.000) 2022-07-27 15:07:42 Job terminated. 005 (236882.000.000) 2022-07-27 15:08:01 Job terminated. 005 (236880.000.000) 2022-07-27 15:08:07 Job terminated. 005 (236891.000.000) 2022-07-27 15:13:31 Job terminated. 005 (236893.000.000) 2022-07-27 15:13:32 Job terminated. 005 (236892.000.000) 2022-07-27 15:13:58 Job terminated. 005 (236890.000.000) 2022-07-27 15:13:59 Job terminated. [username@ap40]$ grep 'termination' my-job.log (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 0) [username@ap40]$ grep 'termination' my-job.log | sort | uniq -c 6 (1) Normal termination (return value 0) 2 (1) Normal termination (return value 1)","title":"Job terminations"},{"location":"materials/scaling/part1-ex3-log-files/#job-resource-usage","text":"Jobs' resource usages (and requests and allocations) are logged in the following format: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 10382 1048576 1468671 Memory (MB) : 692 1024 1024 Run the following grep command to pull out the memory information from your job logs: $ grep 'Memory (MB) *:' my-job.log Look back at the format in the example above. Columns after the : will first show memory usage, then memory requested, and then the memory allocated to your job. Similarly, use the following command to get the disk information from your job logs: $ grep 'Disk (KB) *:' my-job.log Here's some example output from running the memory grep command on the same eight-job log file: [username@ap40]$ grep 'Memory (MB) *:' my-job.log Memory (MB) : 692 1024 1024 Memory (MB) : 714 1024 1024 Memory (MB) : 703 1024 1024 Memory (MB) : 699 1024 1024 Memory (MB) : 705 1024 1024 Memory (MB) : 704 1024 1024 Memory (MB) : 711 1024 1024 Memory (MB) : 697 1024 1024 In this example, the memory usage for the jobs ranged from 692 to 714 MB, and they all requested (and were allocated) 1 GB of memory.","title":"Job resource usage"},{"location":"materials/scaling/part1-ex3-log-files/#other-job-information","text":"See if you can come up with grep commands to gather the number of bytes sent and received by jobs (i.e. how much data was transferred to/from the access point). Here is some example output for comparison: [username@ap40]$ grep '' my-job.log 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job [username@ap40]$ grep '' my-job.log 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job Job log files may also contain additional information about held jobs or interrupted jobs. If you feel that your jobs are bouncing from idle to running and back to idle, or that they are otherwise not making as much progress as you expect, the log files are a good place to check. Though they might eventually become impossibly large to read line-by-line once you start scaling up, using grep to pull out specific lines and using sort and uniq to reduce the output can help you make sense of the information contained in the logs.","title":"Other job information"},{"location":"materials/software/part1-ex1-run-apptainer/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.1: Run and Explore Containers \u00b6 Objective : Run a container interactively Why learn this? : Being able to run a container directly allows you to confirm what is installed and whether any additional scripts or code will work in the context of the container. Setup \u00b6 Make sure you are logged into ap40.uw.osg-htc.org . For this exercise we will be using Apptainer containers maintained by OSG staff or existing containers on Docker Hub. We will set two environment variables that will help lighten the load on the Access Point as we work with containers: $ mkdir ~/apptainer_cache $ export APPTAINER_CACHEDIR = $HOME /apptainer_cache $ export TMPDIR = $HOME /apptainer_cache Exploring Apptainer Containers \u00b6 First, let's try to run a container from the OSG-Supported List . Find the full path for the ubuntu 22.04 container image. To run it, use this command: $ apptainer shell /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 It may take a few minutes to start - don't worry if this happens. Once the container starts, the prompt will change to either Singularity> or Apptainer> . Run ls and pwd . Where are you? Do you see your files? The apptainer shell command will automatically connect your home directory to the running container so you can use your files. How do we know we're in a different Linux environment? Try printing out the Linux version, or checking the version of common tools like gcc or Python: $ cat /etc/os-release $ gcc --version $ python3 --version Exit out of the container by typing exit . Type the same commands back on the normal Access Point. Should they give the same results as when typed in the container, or different? $ cat /etc/os-release $ gcc --version $ python3 --version Exploring Docker Containers \u00b6 The process for interactively running a Docker container will be very similar to an apptainer container. The main difference is a docker:// prefix before the container's identifying name. We are going to be using a Python image from Docker Hub . Click on the \"Tags\" tab to see all the different versions of this container that exists. Let's use version 3.10 . To run it interactively, use this command: $ apptainer shell docker://python:3.10 Once the container starts and the prompt changes, try running similar commands as above. What version of Linux is used in this container? Does the version of Python match what you expect, based on the name of the container? Once done, type exit to leave the container.","title":"1.1 - Run and Explore Apptainer Containers"},{"location":"materials/software/part1-ex1-run-apptainer/#software-exercise-11-run-and-explore-containers","text":"Objective : Run a container interactively Why learn this? : Being able to run a container directly allows you to confirm what is installed and whether any additional scripts or code will work in the context of the container.","title":"Software Exercise 1.1: Run and Explore Containers"},{"location":"materials/software/part1-ex1-run-apptainer/#setup","text":"Make sure you are logged into ap40.uw.osg-htc.org . For this exercise we will be using Apptainer containers maintained by OSG staff or existing containers on Docker Hub. We will set two environment variables that will help lighten the load on the Access Point as we work with containers: $ mkdir ~/apptainer_cache $ export APPTAINER_CACHEDIR = $HOME /apptainer_cache $ export TMPDIR = $HOME /apptainer_cache","title":"Setup"},{"location":"materials/software/part1-ex1-run-apptainer/#exploring-apptainer-containers","text":"First, let's try to run a container from the OSG-Supported List . Find the full path for the ubuntu 22.04 container image. To run it, use this command: $ apptainer shell /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 It may take a few minutes to start - don't worry if this happens. Once the container starts, the prompt will change to either Singularity> or Apptainer> . Run ls and pwd . Where are you? Do you see your files? The apptainer shell command will automatically connect your home directory to the running container so you can use your files. How do we know we're in a different Linux environment? Try printing out the Linux version, or checking the version of common tools like gcc or Python: $ cat /etc/os-release $ gcc --version $ python3 --version Exit out of the container by typing exit . Type the same commands back on the normal Access Point. Should they give the same results as when typed in the container, or different? $ cat /etc/os-release $ gcc --version $ python3 --version","title":"Exploring Apptainer Containers"},{"location":"materials/software/part1-ex1-run-apptainer/#exploring-docker-containers","text":"The process for interactively running a Docker container will be very similar to an apptainer container. The main difference is a docker:// prefix before the container's identifying name. We are going to be using a Python image from Docker Hub . Click on the \"Tags\" tab to see all the different versions of this container that exists. Let's use version 3.10 . To run it interactively, use this command: $ apptainer shell docker://python:3.10 Once the container starts and the prompt changes, try running similar commands as above. What version of Linux is used in this container? Does the version of Python match what you expect, based on the name of the container? Once done, type exit to leave the container.","title":"Exploring Docker Containers"},{"location":"materials/software/part1-ex2-apptainer-jobs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.2: Use Apptainer Containers in OSPool Jobs \u00b6 Objective : Submit a job that uses an existing apptainer container; compare default job environment with a specific container job environment. Why learn this? : By comparing a non-container and container job, you'll better understand what a container can do on the OSPool. This may also be how you end up submitting your jobs if you can find an existing apptainer container with your software. Default Environment \u00b6 First, let's run a job without a container to see what the typical job environment is. Create a bash script with the following lines: #!/bin/bash hostname cat /etc/os-release gcc --version python3 --version This will print out the version of Linux on the computer, the version of gcc , a common software compiler, and the version of Python 3. Make the script executable: $ chmod +x script.sh Run the script on the Access Point. $ ./script.sh What results did you get? Copy a submit file from a previous OSPool job and edit it so that the script you just wrote is the executable. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc or Python? Container Environment \u00b6 Now, let's try running that same script inside a container. For this job, we will use the OSG-provided Ubuntu \"Focal\" image, as we did in the previous exercise. The container_image submit file option will tell HTCondor to use this container for the job: universe = container container_image = /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 If the submit file you copied has something like requirements = (OSGVO_OS_STRING == \"RHEL 9\") , remove that. When you use containers, you should not specify an OS in the requirements as that will unnecessarily limit the number of resources you can run on. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc ? or Python? Experimenting With Other Containers \u00b6 Look at the list of OSG-Supported containers: OSG Supported Containers Try submitting a job that uses one of these containers. Change the executable script to explore different aspects of that container.","title":"1.2 - Use Apptainer Containers in OSPool Jobs"},{"location":"materials/software/part1-ex2-apptainer-jobs/#software-exercise-12-use-apptainer-containers-in-ospool-jobs","text":"Objective : Submit a job that uses an existing apptainer container; compare default job environment with a specific container job environment. Why learn this? : By comparing a non-container and container job, you'll better understand what a container can do on the OSPool. This may also be how you end up submitting your jobs if you can find an existing apptainer container with your software.","title":"Software Exercise 1.2: Use Apptainer Containers in OSPool Jobs"},{"location":"materials/software/part1-ex2-apptainer-jobs/#default-environment","text":"First, let's run a job without a container to see what the typical job environment is. Create a bash script with the following lines: #!/bin/bash hostname cat /etc/os-release gcc --version python3 --version This will print out the version of Linux on the computer, the version of gcc , a common software compiler, and the version of Python 3. Make the script executable: $ chmod +x script.sh Run the script on the Access Point. $ ./script.sh What results did you get? Copy a submit file from a previous OSPool job and edit it so that the script you just wrote is the executable. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc or Python?","title":"Default Environment"},{"location":"materials/software/part1-ex2-apptainer-jobs/#container-environment","text":"Now, let's try running that same script inside a container. For this job, we will use the OSG-provided Ubuntu \"Focal\" image, as we did in the previous exercise. The container_image submit file option will tell HTCondor to use this container for the job: universe = container container_image = /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 If the submit file you copied has something like requirements = (OSGVO_OS_STRING == \"RHEL 9\") , remove that. When you use containers, you should not specify an OS in the requirements as that will unnecessarily limit the number of resources you can run on. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc ? or Python?","title":"Container Environment"},{"location":"materials/software/part1-ex2-apptainer-jobs/#experimenting-with-other-containers","text":"Look at the list of OSG-Supported containers: OSG Supported Containers Try submitting a job that uses one of these containers. Change the executable script to explore different aspects of that container.","title":"Experimenting With Other Containers"},{"location":"materials/software/part1-ex3-docker-jobs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 1.3: Use Docker Containers in OSPool Jobs \u00b6 Objective : Create a local copy of a Docker container, use it to submit a job. Why learn this? : Same as the previous exercise; this may also be how you end up submitting your jobs if you can find an existing Docker container with your software. Create Local Copy of Docker Container \u00b6 While it is technically possible to use a Docker container directly in a job, there are some good reasons for converting it to a local Apptainer container first. We'll do this with the same python:3.10 Docker container we used in the first exercise . To convert the Docker container to a local Apptainer container, run: $ apptainer build local-py310.sif docker://python:3.10 The first argument after build is the name of the new Apptainer container file, the second argument is what we're building from (in this case, Docker). Submit File and Executable \u00b6 Make a copy of your submit file from the previous container exercise or build from an existing submit file. Add the following lines to the submit file or modify existing lines to match the lines below: universe = container container_image = local-py310.sif Use the same executable as the previous exercise . Once these steps are done, submit the job. You might get a warning about using OSDF for container transfers - ignore this warning for now. Finding Docker Containers \u00b6 There are a lot of Docker containers on Docker Hub, but they are not all created equal. Anyone can create an account on Docker Hub and share container images there, so it\u2019s important to exercise caution when choosing a container image on Docker Hub. These are some indicators that a container image on Docker Hub is consistently maintained, functional and secure: The container image is updated regularly. The container image is associated with a well established company, community, or other group that is well-known. There is a Dockerfile or other listing of what has been installed to the container image. The container image page has documentation on how to use the container image. [^1] Given these indicators: Can you find a container on Docker Hub that would be useful for running Jupyter notebooks that use tensorflow? Does your chosen image meet at least 2 of the criteria above? [^1]: This list and previous text taken from Introduction to Docker","title":"1.3 - Use Docker Containers in OSPool Jobs"},{"location":"materials/software/part1-ex3-docker-jobs/#software-exercise-13-use-docker-containers-in-ospool-jobs","text":"Objective : Create a local copy of a Docker container, use it to submit a job. Why learn this? : Same as the previous exercise; this may also be how you end up submitting your jobs if you can find an existing Docker container with your software.","title":"Software Exercise 1.3: Use Docker Containers in OSPool Jobs"},{"location":"materials/software/part1-ex3-docker-jobs/#create-local-copy-of-docker-container","text":"While it is technically possible to use a Docker container directly in a job, there are some good reasons for converting it to a local Apptainer container first. We'll do this with the same python:3.10 Docker container we used in the first exercise . To convert the Docker container to a local Apptainer container, run: $ apptainer build local-py310.sif docker://python:3.10 The first argument after build is the name of the new Apptainer container file, the second argument is what we're building from (in this case, Docker).","title":"Create Local Copy of Docker Container"},{"location":"materials/software/part1-ex3-docker-jobs/#submit-file-and-executable","text":"Make a copy of your submit file from the previous container exercise or build from an existing submit file. Add the following lines to the submit file or modify existing lines to match the lines below: universe = container container_image = local-py310.sif Use the same executable as the previous exercise . Once these steps are done, submit the job. You might get a warning about using OSDF for container transfers - ignore this warning for now.","title":"Submit File and Executable"},{"location":"materials/software/part1-ex3-docker-jobs/#finding-docker-containers","text":"There are a lot of Docker containers on Docker Hub, but they are not all created equal. Anyone can create an account on Docker Hub and share container images there, so it\u2019s important to exercise caution when choosing a container image on Docker Hub. These are some indicators that a container image on Docker Hub is consistently maintained, functional and secure: The container image is updated regularly. The container image is associated with a well established company, community, or other group that is well-known. There is a Dockerfile or other listing of what has been installed to the container image. The container image page has documentation on how to use the container image. [^1] Given these indicators: Can you find a container on Docker Hub that would be useful for running Jupyter notebooks that use tensorflow? Does your chosen image meet at least 2 of the criteria above? [^1]: This list and previous text taken from Introduction to Docker","title":"Finding Docker Containers"},{"location":"materials/software/part1-ex4-apptainer-build/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.4: Build, Test, and Deploy an Apptainer Container \u00b6 Objective : to practice building and using a custom apptainer container Why learn this? : You may need to go through this process if you want to use a container for your jobs and can't find one that has what you need. Motivating Script \u00b6 Create a script called hello-cow.py : #!/usr/bin/env python3 import cowsay cowsay.cow('Hello OSG User School') Give it executable permissions: $ chmod +x hello-cow.py Try running the script: $ ./hello-cow.py It will likely fail, because the cowsay library isn't installed. This is a scenario where we will want to build our own container that includes a base Python installation and the cowsay Python library. Preparing a Definition File \u00b6 We can describe our desired Apptainer image in a special format called a definition file . This has special keywords that will direct Apptainer when it builds the container image. Create a file called py-cowsay.def with these contents: Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 %post apt-get update -y apt-get install -y \\ python3-pip \\ python3-numpy python3 -m pip install cowsay Note that we are starting with the same ubuntu base we used in previous exercises. The %post statement includes our installation commands, including updating the pip and numpy packages, and then using pip to install cowsay . To learn more about definition files, see Exercise 3.1 Build the Container \u00b6 Once the definition file is complete, we can build the container. Run the following command to build the container: $ apptainer build py-cowsay.sif py-cowsay.def As with the Docker image in the previous exercise , the first argument is the name to give to the newly create image file and the second argument is how to build the container image - in this case, the definition file. Testing the Image Locally \u00b6 Do you remember how to interactively test an image? Look back at Exercise 1.1 and guess what command would allow us to test our new container. Try running: $ apptainer shell py-cowsay.sif Then try running the hello-cow.py script: apptainer> ./hello-cow.py If it produces an output, our container works! We can now exit (by typing exit ) and submit a job. Submit a Job \u00b6 Make a copy of a submit file from a previous exercise in this section. Can you guess what options need to be used or modified? Make sure you have the following (in addition to log , error , output and CPU and memory requests): universe = container container_image = py-cowsay.sif executable = hello-cow.py Submit the job and verify the output when it completes. ______________________ | Hello OSG User School! | ====================== \\ \\ ^__^ (oo)\\_______ (__)\\ )\\/\\ ||----w | || ||","title":"1.4 - Build, Test, and Deploy an Apptainer Container"},{"location":"materials/software/part1-ex4-apptainer-build/#software-exercise-14-build-test-and-deploy-an-apptainer-container","text":"Objective : to practice building and using a custom apptainer container Why learn this? : You may need to go through this process if you want to use a container for your jobs and can't find one that has what you need.","title":"Software Exercise 1.4: Build, Test, and Deploy an Apptainer Container"},{"location":"materials/software/part1-ex4-apptainer-build/#motivating-script","text":"Create a script called hello-cow.py : #!/usr/bin/env python3 import cowsay cowsay.cow('Hello OSG User School') Give it executable permissions: $ chmod +x hello-cow.py Try running the script: $ ./hello-cow.py It will likely fail, because the cowsay library isn't installed. This is a scenario where we will want to build our own container that includes a base Python installation and the cowsay Python library.","title":"Motivating Script"},{"location":"materials/software/part1-ex4-apptainer-build/#preparing-a-definition-file","text":"We can describe our desired Apptainer image in a special format called a definition file . This has special keywords that will direct Apptainer when it builds the container image. Create a file called py-cowsay.def with these contents: Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 %post apt-get update -y apt-get install -y \\ python3-pip \\ python3-numpy python3 -m pip install cowsay Note that we are starting with the same ubuntu base we used in previous exercises. The %post statement includes our installation commands, including updating the pip and numpy packages, and then using pip to install cowsay . To learn more about definition files, see Exercise 3.1","title":"Preparing a Definition File"},{"location":"materials/software/part1-ex4-apptainer-build/#build-the-container","text":"Once the definition file is complete, we can build the container. Run the following command to build the container: $ apptainer build py-cowsay.sif py-cowsay.def As with the Docker image in the previous exercise , the first argument is the name to give to the newly create image file and the second argument is how to build the container image - in this case, the definition file.","title":"Build the Container"},{"location":"materials/software/part1-ex4-apptainer-build/#testing-the-image-locally","text":"Do you remember how to interactively test an image? Look back at Exercise 1.1 and guess what command would allow us to test our new container. Try running: $ apptainer shell py-cowsay.sif Then try running the hello-cow.py script: apptainer> ./hello-cow.py If it produces an output, our container works! We can now exit (by typing exit ) and submit a job.","title":"Testing the Image Locally"},{"location":"materials/software/part1-ex4-apptainer-build/#submit-a-job","text":"Make a copy of a submit file from a previous exercise in this section. Can you guess what options need to be used or modified? Make sure you have the following (in addition to log , error , output and CPU and memory requests): universe = container container_image = py-cowsay.sif executable = hello-cow.py Submit the job and verify the output when it completes. ______________________ | Hello OSG User School! | ====================== \\ \\ ^__^ (oo)\\_______ (__)\\ )\\/\\ ||----w | || ||","title":"Submit a Job"},{"location":"materials/software/part1-ex5-pick-an-option/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.5 - Choose Software Options \u00b6 Objective : Decide how you want to make your software portable Why learn this? : This is the next step to getting your own research jobs running on the OSPool! Know Your Software \u00b6 Pick at least one software you want to use on the OSPool as a test subject. Then: Find the download and/or installation page and read through the instructions and options there. Is the software available as a binary download, or will you need to run some kind of command to install it or compile it from source? If there are multiple download/installation options, which is which? What pre-requisites does this software need to be installed? Example 1: an R package will require a base R installation Example 2: some codes require that a library called the \"Gnu Scientific Library (GSL) be already installed on your computer\" Choose a Strategy \u00b6 Are there any existing containers that contain this software already? Explore OSG-Supported Containers Explore DockerHub , for example: miniconda rocker jupyter nvidia (and many more!) If yes, try using this container first, as shown in Exercise 1.2 and Exercise 1.3 Is there a simple download or easy compilation process? If so, can you download the software and use it via a wrapper script? See the exercises from Part 4 ( Download Software Files , Use a Wrapper Script , Wrapper Script Arguments ). To learn more about using this approach for specific softwares, see the examples in Part 5 . Are you using conda? See the specific example in Exercise 5.3 If neither of the above options works (which may be true for more software!), you may want to build your own container. If you want to just use this container on the OSPool, build an Apptainer container as described in Exercise 1.4 and with more information in Exercise 3.1 If you want to use the container on your own computer or share with others who would use it on a laptop or desktop, look at the Docker container example in Exercise 3.2 . Don't do ALL of the software exercises in parts 3 - 5! Instead, choose the section(s) that makes sense based on how you want to manage your software. Talk to the School instructors to help make this decision if you are unsure. Create an Executable \u00b6 Regardless of which approach you use, check out the Build an HTC-Friendly Executable exercise for some tips on how to make your script more robust and easy to use with multiple jobs.","title":"1.5 - Choose Software Options"},{"location":"materials/software/part1-ex5-pick-an-option/#software-exercise-15-choose-software-options","text":"Objective : Decide how you want to make your software portable Why learn this? : This is the next step to getting your own research jobs running on the OSPool!","title":"Software Exercise 1.5 - Choose Software Options"},{"location":"materials/software/part1-ex5-pick-an-option/#know-your-software","text":"Pick at least one software you want to use on the OSPool as a test subject. Then: Find the download and/or installation page and read through the instructions and options there. Is the software available as a binary download, or will you need to run some kind of command to install it or compile it from source? If there are multiple download/installation options, which is which? What pre-requisites does this software need to be installed? Example 1: an R package will require a base R installation Example 2: some codes require that a library called the \"Gnu Scientific Library (GSL) be already installed on your computer\"","title":"Know Your Software"},{"location":"materials/software/part1-ex5-pick-an-option/#choose-a-strategy","text":"Are there any existing containers that contain this software already? Explore OSG-Supported Containers Explore DockerHub , for example: miniconda rocker jupyter nvidia (and many more!) If yes, try using this container first, as shown in Exercise 1.2 and Exercise 1.3 Is there a simple download or easy compilation process? If so, can you download the software and use it via a wrapper script? See the exercises from Part 4 ( Download Software Files , Use a Wrapper Script , Wrapper Script Arguments ). To learn more about using this approach for specific softwares, see the examples in Part 5 . Are you using conda? See the specific example in Exercise 5.3 If neither of the above options works (which may be true for more software!), you may want to build your own container. If you want to just use this container on the OSPool, build an Apptainer container as described in Exercise 1.4 and with more information in Exercise 3.1 If you want to use the container on your own computer or share with others who would use it on a laptop or desktop, look at the Docker container example in Exercise 3.2 . Don't do ALL of the software exercises in parts 3 - 5! Instead, choose the section(s) that makes sense based on how you want to manage your software. Talk to the School instructors to help make this decision if you are unsure.","title":"Choose a Strategy"},{"location":"materials/software/part1-ex5-pick-an-option/#create-an-executable","text":"Regardless of which approach you use, check out the Build an HTC-Friendly Executable exercise for some tips on how to make your script more robust and easy to use with multiple jobs.","title":"Create an Executable"},{"location":"materials/software/part2-ex1-build-executable/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 2.1 Build an HTC-Friendly Executable \u00b6 Objective : Modify an existing script to include arguments and headers. Why learn this? : A little bit of preparation can make it easier to reuse the same script over and over to run many jobs. Setup \u00b6 Download and unzip a set of Protein Data Bank (PDB) files: $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/alkanes.tar.gz $ tar -xzf alkanes.tar.gz For these exercises, we are going to run a command that counts the number of atoms in the PDB file . Run it now as an example: $ grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb Add a Header \u00b6 To create a basic script, you can put the command above into a file called get_atoms.sh . To make it clear what language we expect to use to run the script, we will add the following header on the first line: `#!/bin/bash #!/bin/bash grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb The \"header\" of #!/bin/bash will tell the computer that this is a bash shell script and can be run in the same way that you would run individual commands on the command line. We use /bin/bash instead of just bash because that is the full path to the bash software file. Other languages We can use the same principle for any scripting language. For example, the header for a Python script could be either #!/usr/bin/python3 or #!/usr/bin/env python3 . Similar logic works for perl, R, julia and other scripting languages. Can you now run the script? $ ./get_atoms.sh This gives \"permission denied.\" Let's add executable permissions to the script and try again: $ chmod +x get_atoms.sh $ ./get_atoms.sh Incorporate Arguments \u00b6 Can you imagine trying to run this script on all of our pdb files? It would be tedious to edit it for each one, even for only six inputs. Instead, we should add arguments to the script to make it easy to reuse the script. Any information in a script or executable that is going to change or vary across jobs or analyses should likely be turned into an argument that is specified on the command line. In our example above, which pieces of the script are likely to change or vary? The name of the input file ( cubane.pdb ) and output file ( atoms_cubane.pdb ) should be turned into arguments. Can you envision what our script should look like if we ran it with input arguments? Let's say we want to be able to run the following command: $ ./get_atoms.sh cubane.pdb atoms_cubane.pdb In order to get arguments from the command line into the script, you have to use special variables in the script. In bash, these are $1 (for the first argument), $2 (for the second argument) and so on. Try to figure out where these should go in our get_atoms.sh script. Other Languages Each language is going to have its own syntax for reading command line arguments into the script. In Python, sys.argv is a basic method, and more advanced libraries like argparse can be used. In R, the commandArgs() function can do this. Google \"command line arguments in ______\" to find the right syntax for your language of choice! A first pass at adding arguments might look like this: #!/bin/bash grep ATOM $1 | wc -l > $2 Try running it as described above. Does it work? While we now have arguments, we have lost some of the readability of our script. The numbers $1 and $2 are not very meaningful in themselves! Let's rewrite the script to assign the arguments to meaningful variable names: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=$2 grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} Why curly brackets? You'll notice above that we started using curly brackets around our variables. While you technically don't need them ( $PDB_INPUT would also be fine), using them makes the name of the variable (compared to other text) completely clear. This is especially useful when combining variables with underscores. There is one final place where we could optimize this script. If we want our output files to always have the same naming convention, based on the input file name, then we shouldn't have a separate argument for that -- it's asking for typos. Instead, we should use variables inside the script to construct the output file name, based on the input file. That will look like this: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=atoms_${PDB_INPUT} grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} You may want to construct other variables, like paths and filenames in this way. But it depends on how you want to use the script! If we want the flexibility of specifying a custom output file name, then we should undo this last change so it can be treated as a separate argument. Your Work \u00b6 Are you using a scripting language where you could add a header to your main script? If so, what should it be? What items in your main code or commands are changing? Do you need to add arguments to your code?","title":"2.1 - Build an HTC-Friendly Executable"},{"location":"materials/software/part2-ex1-build-executable/#software-exercise-21-build-an-htc-friendly-executable","text":"Objective : Modify an existing script to include arguments and headers. Why learn this? : A little bit of preparation can make it easier to reuse the same script over and over to run many jobs.","title":"Software Exercise 2.1 Build an HTC-Friendly Executable"},{"location":"materials/software/part2-ex1-build-executable/#setup","text":"Download and unzip a set of Protein Data Bank (PDB) files: $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/alkanes.tar.gz $ tar -xzf alkanes.tar.gz For these exercises, we are going to run a command that counts the number of atoms in the PDB file . Run it now as an example: $ grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb","title":"Setup"},{"location":"materials/software/part2-ex1-build-executable/#add-a-header","text":"To create a basic script, you can put the command above into a file called get_atoms.sh . To make it clear what language we expect to use to run the script, we will add the following header on the first line: `#!/bin/bash #!/bin/bash grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb The \"header\" of #!/bin/bash will tell the computer that this is a bash shell script and can be run in the same way that you would run individual commands on the command line. We use /bin/bash instead of just bash because that is the full path to the bash software file. Other languages We can use the same principle for any scripting language. For example, the header for a Python script could be either #!/usr/bin/python3 or #!/usr/bin/env python3 . Similar logic works for perl, R, julia and other scripting languages. Can you now run the script? $ ./get_atoms.sh This gives \"permission denied.\" Let's add executable permissions to the script and try again: $ chmod +x get_atoms.sh $ ./get_atoms.sh","title":"Add a Header"},{"location":"materials/software/part2-ex1-build-executable/#incorporate-arguments","text":"Can you imagine trying to run this script on all of our pdb files? It would be tedious to edit it for each one, even for only six inputs. Instead, we should add arguments to the script to make it easy to reuse the script. Any information in a script or executable that is going to change or vary across jobs or analyses should likely be turned into an argument that is specified on the command line. In our example above, which pieces of the script are likely to change or vary? The name of the input file ( cubane.pdb ) and output file ( atoms_cubane.pdb ) should be turned into arguments. Can you envision what our script should look like if we ran it with input arguments? Let's say we want to be able to run the following command: $ ./get_atoms.sh cubane.pdb atoms_cubane.pdb In order to get arguments from the command line into the script, you have to use special variables in the script. In bash, these are $1 (for the first argument), $2 (for the second argument) and so on. Try to figure out where these should go in our get_atoms.sh script. Other Languages Each language is going to have its own syntax for reading command line arguments into the script. In Python, sys.argv is a basic method, and more advanced libraries like argparse can be used. In R, the commandArgs() function can do this. Google \"command line arguments in ______\" to find the right syntax for your language of choice! A first pass at adding arguments might look like this: #!/bin/bash grep ATOM $1 | wc -l > $2 Try running it as described above. Does it work? While we now have arguments, we have lost some of the readability of our script. The numbers $1 and $2 are not very meaningful in themselves! Let's rewrite the script to assign the arguments to meaningful variable names: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=$2 grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} Why curly brackets? You'll notice above that we started using curly brackets around our variables. While you technically don't need them ( $PDB_INPUT would also be fine), using them makes the name of the variable (compared to other text) completely clear. This is especially useful when combining variables with underscores. There is one final place where we could optimize this script. If we want our output files to always have the same naming convention, based on the input file name, then we shouldn't have a separate argument for that -- it's asking for typos. Instead, we should use variables inside the script to construct the output file name, based on the input file. That will look like this: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=atoms_${PDB_INPUT} grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} You may want to construct other variables, like paths and filenames in this way. But it depends on how you want to use the script! If we want the flexibility of specifying a custom output file name, then we should undo this last change so it can be treated as a separate argument.","title":"Incorporate Arguments"},{"location":"materials/software/part2-ex1-build-executable/#your-work","text":"Are you using a scripting language where you could add a header to your main script? If so, what should it be? What items in your main code or commands are changing? Do you need to add arguments to your code?","title":"Your Work"},{"location":"materials/software/part3-ex1-apptainer-recipes/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 3.1: Create an Apptainer Definition File \u00b6 Objective : Describe each major section of an Apptainer Definition file. Why learn this? : When building your own containers, it is helpful to understand the basic options and syntax of the \"build\" or definition file. Section Bootstrap/From %files %post %env Where to start \u00b6 Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 A custom container always is always built on an existing container. It is common to use a container on Docker Hub, or in this case, hub.opensciencegrid.org. These lines tell Apptainer to pull the pre-existing image from the hub, and to use it as the base for the container that will be built using this definition file. When choosing a base container, try to find one that has most of what you need - for example, if you want to install R packages, try to find a container that already has R installed. Files needed for building or running \u00b6 %files source_code.tar.gz /opt install.R If you need specific files for the installation (like source code) or for the job to execute (like small data files or scripts), they can be copied into the container under the %files section. The first item on a line is what to copy (from your computer) and the optional second item is where it should be copied in the container. Normally the files being copied are in your local working directory where you run the build command. Commands to install \u00b6 %post apt-get update -y apt-get install -y \\ build-essential \\ cmake \\ g++ \\ r-base-dev install2.r tidyverse This is where most of the installation happens. You can use any shell command here that will work in the base container to install software. These commands might include: - Linux installation tools like apt or yum - Scripting specific installers like pip , conda or install.packages() - Shell commands like tar , configure , make Different distributions of Linux often have distinct sets of tools for installing software. The installers for various common Linux distributions are listed below: Ubuntu: apt or apt-get Debian: deb CentOS: yum A web search for \u201cinstall X on Y Linux\u201d is usually a good start for common software installation tasks. [^1] When installing to a custom location, do not install to a home directory. This is likely to get overwritten when the container is run. Instead, /opt is the best directory for custom installations. Environment \u00b6 %environment PATH=/opt/mycode/bin:$PATH JAVA_HOME=/opt/java-1.8 To set environment variables (especially useful for software in a custom location), use the %environment section of the definition file. [^1]: This text and previous list taken from Introduction to Docker","title":"3.1 - Create an Apptainer Definition Files"},{"location":"materials/software/part3-ex1-apptainer-recipes/#software-exercise-31-create-an-apptainer-definition-file","text":"Objective : Describe each major section of an Apptainer Definition file. Why learn this? : When building your own containers, it is helpful to understand the basic options and syntax of the \"build\" or definition file. Section Bootstrap/From %files %post %env","title":"Software Exercise 3.1: Create an Apptainer Definition File"},{"location":"materials/software/part3-ex1-apptainer-recipes/#where-to-start","text":"Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 A custom container always is always built on an existing container. It is common to use a container on Docker Hub, or in this case, hub.opensciencegrid.org. These lines tell Apptainer to pull the pre-existing image from the hub, and to use it as the base for the container that will be built using this definition file. When choosing a base container, try to find one that has most of what you need - for example, if you want to install R packages, try to find a container that already has R installed.","title":"Where to start"},{"location":"materials/software/part3-ex1-apptainer-recipes/#files-needed-for-building-or-running","text":"%files source_code.tar.gz /opt install.R If you need specific files for the installation (like source code) or for the job to execute (like small data files or scripts), they can be copied into the container under the %files section. The first item on a line is what to copy (from your computer) and the optional second item is where it should be copied in the container. Normally the files being copied are in your local working directory where you run the build command.","title":"Files needed for building or running"},{"location":"materials/software/part3-ex1-apptainer-recipes/#commands-to-install","text":"%post apt-get update -y apt-get install -y \\ build-essential \\ cmake \\ g++ \\ r-base-dev install2.r tidyverse This is where most of the installation happens. You can use any shell command here that will work in the base container to install software. These commands might include: - Linux installation tools like apt or yum - Scripting specific installers like pip , conda or install.packages() - Shell commands like tar , configure , make Different distributions of Linux often have distinct sets of tools for installing software. The installers for various common Linux distributions are listed below: Ubuntu: apt or apt-get Debian: deb CentOS: yum A web search for \u201cinstall X on Y Linux\u201d is usually a good start for common software installation tasks. [^1] When installing to a custom location, do not install to a home directory. This is likely to get overwritten when the container is run. Instead, /opt is the best directory for custom installations.","title":"Commands to install"},{"location":"materials/software/part3-ex1-apptainer-recipes/#environment","text":"%environment PATH=/opt/mycode/bin:$PATH JAVA_HOME=/opt/java-1.8 To set environment variables (especially useful for software in a custom location), use the %environment section of the definition file. [^1]: This text and previous list taken from Introduction to Docker","title":"Environment"},{"location":"materials/software/part3-ex2-docker-build/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 3.2: Build Your Own Docker Container (Optional) \u00b6 Objective : Build a custom Docker container with numpy and use it in a job Why learn this? : Docker containers can be run on both your laptop and OSPool. DockerHub also provides a convenient platform for sharing containers. If you want to use a custom container, run across platforms, and/or share a container amongst a group, building in Docker first is a good approach. Python Script \u00b6 For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library. This exercise will walk you through the steps to build your own Docker container based on Python, with the numpy Python library added on. Getting Set Up \u00b6 Before building your own Docker container, you need to go through the following set up steps: Install Docker Dekstop on your computer. Docker Desktop page You may need to create a Docker Hub user name to download Docker Desktop; if not created at that step, create a user name for Docker Hub now. (Optional): Once Docker is up and running on your computer, you are welcome to take some time to explore the basics of downloading and running a container, as shown in the initial sections of this Docker lesson: Introduction to Docker However, this isn't strictly necessary for building your own container. Building a Container \u00b6 In order to make our container reproducible, we will be using Docker's capability to build a container image from a specification file. First, create an empty build directory on your computer , not the Access Points. In the build directory, create a file called Dockerfile (no file extension!) with the following contents: # Start with this image as a \"base\". # It's as if all the commands that created that image were inserted here. # Always use a specific tag like \"4.10.3\", never \"latest\"! # The version referenced by \"latest\" can change, so the build will be # more stable when building from a specific version tag. FROM continuumio/miniconda3:4.10.3 # Use RUN to execute commands inside the image as it is being built up. RUN conda install --yes numpy # RUN multiple commands together. # Try to always \"clean up\" after yourself to reduce the final size of your image. RUN apt-get update \\ && apt-get --yes install --no-install-recommends graphviz \\ && apt-get --yes clean \\ && rm -rf /var/lib/apt/lists/* This is our specification file and provides Docker with the information it needs to build our new container. There are other options besides FROM and RUN ; see the Docker documentation for more information. Note that our container is starting from an existing container continuumio/miniconda3:4.10.3 . This container is produced by the continuumio organization; the number 4.10.3 indicates the container version. When we create our new container, we will want to use a similar naming scheme of: USERNAME/CONTAINER:VERSIONTAG In what follows, you will want to replace USERNAME with your DockerHub user name. The CONTAINER name and VERSIONTAG are your choice; in what follows, we will use py3-numpy as the container name and 2024-08 as the version tag. To build and name the new container, open a command line window on your computer where you can run Docker commands. Use the cd command to change your working directory to the build directory with the Dockerfile inside. $ docker build -t USERNAME/py3-numpy:2024-08 . Note the . at the end of the command! This indicates that we're using the current directory as our build environment, including the Dockerfile inside. Upload Container and Submit Job \u00b6 Right now the container image only exists on your computer. To use it in CHTC or elsewhere, it needs to be added to a public registry like Docker Hub. To put your container image in Docker Hub, use the docker push command on the command line: $ docker push USERNAME/py3-numpy:2024-08 If the push doesn't work, you may need to run docker login first, enter your Docker Hub username and password and then try the push again. Once your container image is in DockerHub, you can use it in jobs as described in Exercise 1.3 . Thanks to Josh Karpel for providing the original sample Dockerfile !","title":"3.2 - Build Your Own Docker Container"},{"location":"materials/software/part3-ex2-docker-build/#software-exercise-32-build-your-own-docker-container-optional","text":"Objective : Build a custom Docker container with numpy and use it in a job Why learn this? : Docker containers can be run on both your laptop and OSPool. DockerHub also provides a convenient platform for sharing containers. If you want to use a custom container, run across platforms, and/or share a container amongst a group, building in Docker first is a good approach.","title":"Software Exercise 3.2: Build Your Own Docker Container (Optional)"},{"location":"materials/software/part3-ex2-docker-build/#python-script","text":"For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library. This exercise will walk you through the steps to build your own Docker container based on Python, with the numpy Python library added on.","title":"Python Script"},{"location":"materials/software/part3-ex2-docker-build/#getting-set-up","text":"Before building your own Docker container, you need to go through the following set up steps: Install Docker Dekstop on your computer. Docker Desktop page You may need to create a Docker Hub user name to download Docker Desktop; if not created at that step, create a user name for Docker Hub now. (Optional): Once Docker is up and running on your computer, you are welcome to take some time to explore the basics of downloading and running a container, as shown in the initial sections of this Docker lesson: Introduction to Docker However, this isn't strictly necessary for building your own container.","title":"Getting Set Up"},{"location":"materials/software/part3-ex2-docker-build/#building-a-container","text":"In order to make our container reproducible, we will be using Docker's capability to build a container image from a specification file. First, create an empty build directory on your computer , not the Access Points. In the build directory, create a file called Dockerfile (no file extension!) with the following contents: # Start with this image as a \"base\". # It's as if all the commands that created that image were inserted here. # Always use a specific tag like \"4.10.3\", never \"latest\"! # The version referenced by \"latest\" can change, so the build will be # more stable when building from a specific version tag. FROM continuumio/miniconda3:4.10.3 # Use RUN to execute commands inside the image as it is being built up. RUN conda install --yes numpy # RUN multiple commands together. # Try to always \"clean up\" after yourself to reduce the final size of your image. RUN apt-get update \\ && apt-get --yes install --no-install-recommends graphviz \\ && apt-get --yes clean \\ && rm -rf /var/lib/apt/lists/* This is our specification file and provides Docker with the information it needs to build our new container. There are other options besides FROM and RUN ; see the Docker documentation for more information. Note that our container is starting from an existing container continuumio/miniconda3:4.10.3 . This container is produced by the continuumio organization; the number 4.10.3 indicates the container version. When we create our new container, we will want to use a similar naming scheme of: USERNAME/CONTAINER:VERSIONTAG In what follows, you will want to replace USERNAME with your DockerHub user name. The CONTAINER name and VERSIONTAG are your choice; in what follows, we will use py3-numpy as the container name and 2024-08 as the version tag. To build and name the new container, open a command line window on your computer where you can run Docker commands. Use the cd command to change your working directory to the build directory with the Dockerfile inside. $ docker build -t USERNAME/py3-numpy:2024-08 . Note the . at the end of the command! This indicates that we're using the current directory as our build environment, including the Dockerfile inside.","title":"Building a Container"},{"location":"materials/software/part3-ex2-docker-build/#upload-container-and-submit-job","text":"Right now the container image only exists on your computer. To use it in CHTC or elsewhere, it needs to be added to a public registry like Docker Hub. To put your container image in Docker Hub, use the docker push command on the command line: $ docker push USERNAME/py3-numpy:2024-08 If the push doesn't work, you may need to run docker login first, enter your Docker Hub username and password and then try the push again. Once your container image is in DockerHub, you can use it in jobs as described in Exercise 1.3 . Thanks to Josh Karpel for providing the original sample Dockerfile !","title":"Upload Container and Submit Job"},{"location":"materials/software/part4-ex1-download/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 4.1: Using a Pre-compiled Binary \u00b6 Objective : Identify software that can be downloaded; download it and use it to run a job. Why learn this? : Some software doesn't require much \"installation\" - you can just download it and run. Recognizing when this is possible can save you time. Our Software Example \u00b6 The software we will be using for this example is a common tool for aligning genome and protein sequences against a reference database, the BLAST program. Search the internet for the BLAST software. Searches might include \"blast executable or \"download blast software\". Hopefully these searches will lead you to a BLAST website page that looks like this: Click on the title that says \"Download BLAST\" and then look for the link that has the latest installation and source code . This will either open a page in a web browser that looks like this: Or you will be asked to open the link in your file browser (choose the Connect as Guest option): In either case, you should end up on a page with a list of each version of BLAST that is available for different operating systems. We could download the source and compile it ourselves, but instead, we're going to use one of the pre-built binaries. Before proceeding, look at the list of downloads and try to determine which one you want. Based on our operating system, we want to use the Linux binary, which is labelled with the x64-linux suffix. All the other links are either for source code or other operating systems. On the Access Point, create a directory for this exercise. Then download the appropriate tar.gz file and un-tar/decompress it it. If you want to do this all from the command line, the sequence will look like this (using wget as the download command.) user@login $ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.15.0/ncbi-blast-2.15.0+-x64-linux.tar.gz user@login $ tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz We're going to be using the blastx binary in our job. Where is it in the directory you just decompressed? Copy the Input Files \u00b6 To run BLAST, we need an input file and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Download these files to your current directory: username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: username@login $ tar -xzf pdbaa.tar.gz Submitting the Job \u00b6 We now have our program (the pre-compiled blastx binary) and our input files, so all that remains is to create the submit file. The form of a typical blastx command looks something like this: blastx -db -query -out Copy a submit file from one of the Day 1 exercises or previous software exercises to use for this exercise. Think about which lines you will need to change or add to your submit file in order to submit the job successfully. In particular: What is the executable? How can you indicate the entire command line sequence above? Which files need to be transferred in addition to the executable? Does this job require a certain type of operating system? Do you have any idea how much memory or disk to request? Try to answer these questions and modify your submit file appropriately. Once you have done all you can, check your submit file against the lines below, which contain the exact components to run this particular job. The executable is blastx , which is located in the bin directory of our downloaded BLAST directory. We need to use the arguments line in the submit file to express the rest of the command. executable = ncbi-blast-2.15.0+/bin/blastx arguments = -db pdbaa/pdbaa -query mouse.fa -out results.txt The BLAST program requires our input file and database, so they must be transferred with transfer_input_files . transfer_input_files = pdbaa, mouse.fa Let's assume that we've run this program before, and we know that 1GB of disk and 1GB of memory will be MORE than enough (the 'log' file will tell us how accurate we are, after the job runs): request_memory = 1GB request_disk = 1GB Submit the blast job using condor_submit . Once the job starts, it should run in just a few minutes and produce a file called results.txt .","title":"4.1 - Download and Use Compiled Software"},{"location":"materials/software/part4-ex1-download/#software-exercise-41-using-a-pre-compiled-binary","text":"Objective : Identify software that can be downloaded; download it and use it to run a job. Why learn this? : Some software doesn't require much \"installation\" - you can just download it and run. Recognizing when this is possible can save you time.","title":"Software Exercise 4.1: Using a Pre-compiled Binary"},{"location":"materials/software/part4-ex1-download/#our-software-example","text":"The software we will be using for this example is a common tool for aligning genome and protein sequences against a reference database, the BLAST program. Search the internet for the BLAST software. Searches might include \"blast executable or \"download blast software\". Hopefully these searches will lead you to a BLAST website page that looks like this: Click on the title that says \"Download BLAST\" and then look for the link that has the latest installation and source code . This will either open a page in a web browser that looks like this: Or you will be asked to open the link in your file browser (choose the Connect as Guest option): In either case, you should end up on a page with a list of each version of BLAST that is available for different operating systems. We could download the source and compile it ourselves, but instead, we're going to use one of the pre-built binaries. Before proceeding, look at the list of downloads and try to determine which one you want. Based on our operating system, we want to use the Linux binary, which is labelled with the x64-linux suffix. All the other links are either for source code or other operating systems. On the Access Point, create a directory for this exercise. Then download the appropriate tar.gz file and un-tar/decompress it it. If you want to do this all from the command line, the sequence will look like this (using wget as the download command.) user@login $ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.15.0/ncbi-blast-2.15.0+-x64-linux.tar.gz user@login $ tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz We're going to be using the blastx binary in our job. Where is it in the directory you just decompressed?","title":"Our Software Example"},{"location":"materials/software/part4-ex1-download/#copy-the-input-files","text":"To run BLAST, we need an input file and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Download these files to your current directory: username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: username@login $ tar -xzf pdbaa.tar.gz","title":"Copy the Input Files"},{"location":"materials/software/part4-ex1-download/#submitting-the-job","text":"We now have our program (the pre-compiled blastx binary) and our input files, so all that remains is to create the submit file. The form of a typical blastx command looks something like this: blastx -db -query -out Copy a submit file from one of the Day 1 exercises or previous software exercises to use for this exercise. Think about which lines you will need to change or add to your submit file in order to submit the job successfully. In particular: What is the executable? How can you indicate the entire command line sequence above? Which files need to be transferred in addition to the executable? Does this job require a certain type of operating system? Do you have any idea how much memory or disk to request? Try to answer these questions and modify your submit file appropriately. Once you have done all you can, check your submit file against the lines below, which contain the exact components to run this particular job. The executable is blastx , which is located in the bin directory of our downloaded BLAST directory. We need to use the arguments line in the submit file to express the rest of the command. executable = ncbi-blast-2.15.0+/bin/blastx arguments = -db pdbaa/pdbaa -query mouse.fa -out results.txt The BLAST program requires our input file and database, so they must be transferred with transfer_input_files . transfer_input_files = pdbaa, mouse.fa Let's assume that we've run this program before, and we know that 1GB of disk and 1GB of memory will be MORE than enough (the 'log' file will tell us how accurate we are, after the job runs): request_memory = 1GB request_disk = 1GB Submit the blast job using condor_submit . Once the job starts, it should run in just a few minutes and produce a file called results.txt .","title":"Submitting the Job"},{"location":"materials/software/part4-ex2-wrapper/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 4.2: Writing a Wrapper Script \u00b6 Objective : Run downloaded software files via an intermediate, \"wrapper\" script. Why learn this? : This change is a good test of your general HTCondor knowledge and how to translate between executable and submit file. Using wrapper scripts is also a common practice for managing what happens in a job. Background \u00b6 Wrapper scripts are a useful tool for running software that can't be compiled into one piece, needs to be installed with every job, or just for running extra steps. A wrapper script can either install the software from the source code, or use an already existing software (as in this exercise). Not only does this portability technique work with almost any kind of software that can be locally installed, it also allows for a great deal of control and flexibility for what happens within your job. Once you can write a script to handle your software (and often your data as well), you can submit a large variety of workflows to a distributed computing system like the Open Science Grid. For this exercise, we will write a wrapper script as an alternate way to run the same job as the previous exercise. Wrapper Script, part 1 \u00b6 Our wrapper script will be a bash script that runs several commands. In the same directory as the last exercise, make a file called run_blast.sh . The first line we'll place in the script is the basic command for running blast. Based on our previous submit file, what command needs to go into the script? Once you have an idea, check against the example below: #!/bin/bash ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results.txt Submit File Changes \u00b6 We now need to make some changes to our submit file. Make a copy of your previous submit file and open it to edit. Since we are now using a wrapper script, that will be our job's executable. Replace the original blastx exeuctable with the name of our wrapper script and comment out the arguments line. executable = run_blast.sh #arguments = Note that since the blastx program is no longer listed as the executable, it will be need to be included in transfer_input_files . Instead of transferring just that program, we will transfer the original downloaded tar.gz file. To achieve efficiency, we'll also transfer the pdbaa database as the original tar.gz file instead of as the unzipped folder: transfer_input_files = pdbaa.tar.gz, mouse.fa, ncbi-blast-2.15.0+-x64-linux.tar.gz If you really want to be on top of things, look at the log file for the last exercise, and update your memory and disk requests to be just slightly above the actual \"Usage\" values in the log. Before submitting, make sure to make the below additional changes to the wrapper script! Wrapper Script, part 2 \u00b6 Now that our database and BLAST software are being transferred to the job as tar.gz files, our script needs to accommodate. Opening your run_blast.sh script, add two commands at the start to un-tar the BLAST and pdbaa tar.gz files. See the previous exercise if you're not sure what these commands looks like. In order to distinguish this job from our previous job, change the output file name to something besides results.txt . The completed script run_blast.sh should look like this: #!/bin/bash tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt While not strictly necessary, it's a good idea to enable executable permissions on the wrapper script, like so: username@login $ chmod u+x run_blast.sh Your job is now ready to submit. Submit it using condor_submit and monitor using condor_q .","title":"4.2 - Use a Wrapper Script To Run Software"},{"location":"materials/software/part4-ex2-wrapper/#software-exercise-42-writing-a-wrapper-script","text":"Objective : Run downloaded software files via an intermediate, \"wrapper\" script. Why learn this? : This change is a good test of your general HTCondor knowledge and how to translate between executable and submit file. Using wrapper scripts is also a common practice for managing what happens in a job.","title":"Software Exercise 4.2: Writing a Wrapper Script"},{"location":"materials/software/part4-ex2-wrapper/#background","text":"Wrapper scripts are a useful tool for running software that can't be compiled into one piece, needs to be installed with every job, or just for running extra steps. A wrapper script can either install the software from the source code, or use an already existing software (as in this exercise). Not only does this portability technique work with almost any kind of software that can be locally installed, it also allows for a great deal of control and flexibility for what happens within your job. Once you can write a script to handle your software (and often your data as well), you can submit a large variety of workflows to a distributed computing system like the Open Science Grid. For this exercise, we will write a wrapper script as an alternate way to run the same job as the previous exercise.","title":"Background"},{"location":"materials/software/part4-ex2-wrapper/#wrapper-script-part-1","text":"Our wrapper script will be a bash script that runs several commands. In the same directory as the last exercise, make a file called run_blast.sh . The first line we'll place in the script is the basic command for running blast. Based on our previous submit file, what command needs to go into the script? Once you have an idea, check against the example below: #!/bin/bash ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results.txt","title":"Wrapper Script, part 1"},{"location":"materials/software/part4-ex2-wrapper/#submit-file-changes","text":"We now need to make some changes to our submit file. Make a copy of your previous submit file and open it to edit. Since we are now using a wrapper script, that will be our job's executable. Replace the original blastx exeuctable with the name of our wrapper script and comment out the arguments line. executable = run_blast.sh #arguments = Note that since the blastx program is no longer listed as the executable, it will be need to be included in transfer_input_files . Instead of transferring just that program, we will transfer the original downloaded tar.gz file. To achieve efficiency, we'll also transfer the pdbaa database as the original tar.gz file instead of as the unzipped folder: transfer_input_files = pdbaa.tar.gz, mouse.fa, ncbi-blast-2.15.0+-x64-linux.tar.gz If you really want to be on top of things, look at the log file for the last exercise, and update your memory and disk requests to be just slightly above the actual \"Usage\" values in the log. Before submitting, make sure to make the below additional changes to the wrapper script!","title":"Submit File Changes"},{"location":"materials/software/part4-ex2-wrapper/#wrapper-script-part-2","text":"Now that our database and BLAST software are being transferred to the job as tar.gz files, our script needs to accommodate. Opening your run_blast.sh script, add two commands at the start to un-tar the BLAST and pdbaa tar.gz files. See the previous exercise if you're not sure what these commands looks like. In order to distinguish this job from our previous job, change the output file name to something besides results.txt . The completed script run_blast.sh should look like this: #!/bin/bash tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt While not strictly necessary, it's a good idea to enable executable permissions on the wrapper script, like so: username@login $ chmod u+x run_blast.sh Your job is now ready to submit. Submit it using condor_submit and monitor using condor_q .","title":"Wrapper Script, part 2"},{"location":"materials/software/part4-ex3-arguments/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 4.3: Passing Arguments Through the Wrapper Script \u00b6 Objective : Add arguments to a wrapper script to make it more flexible and modular Why learn this? : Using script arguments will allow you to use the same script for multiple jobs, by providing different inputs or parameters. These arguments are normally passed on the command line, but in our world of job submission, the arguments will be listed in the submit file, in the arguments line. Identifying Potential Arguments \u00b6 In the same directory as the last exercise, make sure you're in the directory with your BLAST job submission. What values might we want to input to the script via arguments? Hint: anything that we might want to change if we were to run the script many times. In this example, some values we might want to change are the name of the comparison database, the input file, and the output file. Modifying Files \u00b6 We are going to add three arguments to the wrapper script, controlling the database, input and output file. Make a copy of your last submit file and open it for editing. Add an arguments line, or uncomment the one that exists, and add the three input values mentioned above. The arguments line in your submit file should look like this: arguments = pdbaa mouse.fa results3.txt (We're using results3.txt ) to distinguish between the previous two runs.) For bash (the language of our current wrapper script), the variables $1 , $2 and $3 represent the first, second, and third arguments, respectively. Thus, in the main command of the script, replace the various names with these variables: ./ncbi-blast-2.15.0+/bin/blastx -db $1 / $1 -query $2 -out $3 If your wrapper script is in a different language, you should use that language's syntax for reading in variables from the command line. Once these changes are made, submit your jobs with condor_submit . Use condor_q -nobatch to see what the job command looks like to HTCondor. It is now easy to change the inputs for the job; we can write them into the arguments line of the submit file and they will be propagated to the command in the wrapper script. We can even turn the submit file arguments into their own variables when submitting multiple jobs at once. Readability with Variables \u00b6 One of the downsides of this approach, is that our command has become harder to read. The original script contains all the information at a glance: ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt But our new version is more cryptic -- what is $1 ?: ./ncbi-blast-2.15.0+/bin/blastx -db $1 -query $2 -out $3 One way to overcome this is to create our own variable names inside the wrapper script and assign the argument values to them. Here is an example for our BLAST script: #!/bin/bash DATABASE = $1 INFILE = $2 OUTFILE = $3 tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db $DATABASE / $DATABASE -query $INFILE -out $OUTFILE Here, we are assigning the input arguments ( $1 , $2 and $3 ) to new variable names, and then using those names ( $DATABASE , $INFILE , and $OUTFILE ) in the command, which is easier to read. Edit your script to match the above syntax. Submit your jobs with condor_submit . When the job finishes, look at the job's standard output file to see how the variables printed.","title":"4.3 - Using Arguments With Wrapper Scripts"},{"location":"materials/software/part4-ex3-arguments/#software-exercise-43-passing-arguments-through-the-wrapper-script","text":"Objective : Add arguments to a wrapper script to make it more flexible and modular Why learn this? : Using script arguments will allow you to use the same script for multiple jobs, by providing different inputs or parameters. These arguments are normally passed on the command line, but in our world of job submission, the arguments will be listed in the submit file, in the arguments line.","title":"Software Exercise 4.3: Passing Arguments Through the Wrapper Script"},{"location":"materials/software/part4-ex3-arguments/#identifying-potential-arguments","text":"In the same directory as the last exercise, make sure you're in the directory with your BLAST job submission. What values might we want to input to the script via arguments? Hint: anything that we might want to change if we were to run the script many times. In this example, some values we might want to change are the name of the comparison database, the input file, and the output file.","title":"Identifying Potential Arguments"},{"location":"materials/software/part4-ex3-arguments/#modifying-files","text":"We are going to add three arguments to the wrapper script, controlling the database, input and output file. Make a copy of your last submit file and open it for editing. Add an arguments line, or uncomment the one that exists, and add the three input values mentioned above. The arguments line in your submit file should look like this: arguments = pdbaa mouse.fa results3.txt (We're using results3.txt ) to distinguish between the previous two runs.) For bash (the language of our current wrapper script), the variables $1 , $2 and $3 represent the first, second, and third arguments, respectively. Thus, in the main command of the script, replace the various names with these variables: ./ncbi-blast-2.15.0+/bin/blastx -db $1 / $1 -query $2 -out $3 If your wrapper script is in a different language, you should use that language's syntax for reading in variables from the command line. Once these changes are made, submit your jobs with condor_submit . Use condor_q -nobatch to see what the job command looks like to HTCondor. It is now easy to change the inputs for the job; we can write them into the arguments line of the submit file and they will be propagated to the command in the wrapper script. We can even turn the submit file arguments into their own variables when submitting multiple jobs at once.","title":"Modifying Files"},{"location":"materials/software/part4-ex3-arguments/#readability-with-variables","text":"One of the downsides of this approach, is that our command has become harder to read. The original script contains all the information at a glance: ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt But our new version is more cryptic -- what is $1 ?: ./ncbi-blast-2.15.0+/bin/blastx -db $1 -query $2 -out $3 One way to overcome this is to create our own variable names inside the wrapper script and assign the argument values to them. Here is an example for our BLAST script: #!/bin/bash DATABASE = $1 INFILE = $2 OUTFILE = $3 tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db $DATABASE / $DATABASE -query $INFILE -out $OUTFILE Here, we are assigning the input arguments ( $1 , $2 and $3 ) to new variable names, and then using those names ( $DATABASE , $INFILE , and $OUTFILE ) in the command, which is easier to read. Edit your script to match the above syntax. Submit your jobs with condor_submit . When the job finishes, look at the job's standard output file to see how the variables printed.","title":"Readability with Variables"},{"location":"materials/software/part5-ex1-prepackaged/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 5.1: Pre-package a Research Code \u00b6 Objective : Install software (HMMER) to a folder and run it in a job using a wrapper script. Why learn this? : If not using a container, this is a template for how to create a portable software installation using your own files, especially if the software is not available already compiled for Linux. Our Software Example \u00b6 For this exercise, we will be using the bioinformatics package HMMER. HMMER is a good example of software that is not compiled to a single executable; it has multiple executables as well as a helper library. Create a directory for this exercise on the Access Point. Do an internet search to find the HMMER software downloads page and the installation instructions page. On the installation page, there are short instructions for how to install HMMER. There are two options shown for installation -- which should we use? For the purposes of this example, we are going to use the instructions under the \"Current version\" heading, with the \"Source\" link. Download the HMMER source using wget. Go back to the installation documentation page and look at the steps for compiling from source. This process should be similar to what was described in the lecture! Installation \u00b6 Normally, it is better to install software on a dedicated \"build\" server, but for this example, we are going to compile directly on the Access Point Before we follow the installation instructions, we should create a directory to hold our installation. You can create this in the current directory. username@host $ mkdir hmmer-build Now run the commands to unpack the source code: username@host $ tar -zxf hmmer-3.4.tar.gz username@host $ cd hmmer-3.4 Now we can follow the second set of installation instructions. For the prefix, we'll use the variable $PWD to capture the name of our current working directory and then a relative path to the hmmer-build directory we created in step 1: username@host $ ./configure --prefix = $PWD /../hmmer-build username@host $ make username@host $ make install Go back to the previous working directory : username@host $ cd .. and confirm that our installation procedure created bin , lib , and share directories in the hmmer-build folder: username@host $ ls hmmer-build bin share Now we want to package up our installation, so we can use it in other jobs. We can do this by compressing any necessary directories into a single gzipped tarball. username@host $ tar -czf hmmer-build.tar.gz hmmer-build Note that we now have two tarballs in our directory -- the source tarball ( hmmer.tar.gz ), which we will no longer need and our newly built installation ( hmmer-build.tar.gz ) which is what we will actually be using to run jobs. Wrapper Script \u00b6 Now that we've created our portable installation, we need to write a script that opens and uses the installation, similar to the process we used in a previous exercise . These steps should be performed back on the submit server ( ap1.facility.path-cc.io ). Create a script called run_hmmer.sh . The script will first need to untar our installation, so the script should start out like this: #!/bin/bash tar -xzf hmmer-build.tar.gz We're going to use the same $PWD trick from the installation in order to tell the computer how to find HMMER. We will do this by setting the PATH environment variable, to include the directory where HMMER is installed: export PATH = $PWD /hmmer-build/bin: $PATH Finally, the wrapper script needs to not only setup HMMER, but actually run the program. Add the following lines to your run_hmmer.sh wrapper script. hmmbuild globins4.hmm globins4.sto hmmsearch -o search-results.txt globins4.hmm globins45.fa Make sure the wrapper script has executable permissions: username@login $ chmod u+x run_hmmer.sh Run a HMMER job \u00b6 We're almost ready! We need two more pieces to run a HMMER job. We're going to use some of the tutorial files provided with the HMMER download to run the job. You already have these files back in the directory where you unpacked the source code: username@login $ ls hmmer-3.4/tutorial 7LESS_DROME fn3.hmm globins45.fa globins4.sto MADE1.hmm Pkinase.hmm dna_target.fa fn3.sto globins4.hmm HBB_HUMAN MADE1.sto Pkinase.sto If you don't see these files, you may want to redownload the hmmer.tar.gz file and untar it here. Our last step is to create a submit file for our HMMER job. Think about which lines this submit file will need. Make a copy of a previous submit file (you could use the blast submit file from a previous exercise as a base) and modify it as you think necessary. The two most important lines to modify for this job are listed below; check them against your own submit file: executable = run_hmmer.sh transfer_input_files = hmmer-build.tar.gz, hmmer-3.4/tutorial/ A wrapper script will always be a job's executable . When using a wrapper script, you must also always remember to transfer the software/source code using transfer_input_files . Note The / in the transfer_input_files line indicates that we are transferring the contents of that directory (which in this case, is what we want), rather than the directory itself. Submit the job with condor_submit . Once the job completes, it should produce a search-results.txt file. Note For a very similar compiling example, see this guide on how to compile samtools : Example Software Compilation","title":"5.1 - Compiling a Research Software"},{"location":"materials/software/part5-ex1-prepackaged/#software-exercise-51-pre-package-a-research-code","text":"Objective : Install software (HMMER) to a folder and run it in a job using a wrapper script. Why learn this? : If not using a container, this is a template for how to create a portable software installation using your own files, especially if the software is not available already compiled for Linux.","title":"Software Exercise 5.1: Pre-package a Research Code"},{"location":"materials/software/part5-ex1-prepackaged/#our-software-example","text":"For this exercise, we will be using the bioinformatics package HMMER. HMMER is a good example of software that is not compiled to a single executable; it has multiple executables as well as a helper library. Create a directory for this exercise on the Access Point. Do an internet search to find the HMMER software downloads page and the installation instructions page. On the installation page, there are short instructions for how to install HMMER. There are two options shown for installation -- which should we use? For the purposes of this example, we are going to use the instructions under the \"Current version\" heading, with the \"Source\" link. Download the HMMER source using wget. Go back to the installation documentation page and look at the steps for compiling from source. This process should be similar to what was described in the lecture!","title":"Our Software Example"},{"location":"materials/software/part5-ex1-prepackaged/#installation","text":"Normally, it is better to install software on a dedicated \"build\" server, but for this example, we are going to compile directly on the Access Point Before we follow the installation instructions, we should create a directory to hold our installation. You can create this in the current directory. username@host $ mkdir hmmer-build Now run the commands to unpack the source code: username@host $ tar -zxf hmmer-3.4.tar.gz username@host $ cd hmmer-3.4 Now we can follow the second set of installation instructions. For the prefix, we'll use the variable $PWD to capture the name of our current working directory and then a relative path to the hmmer-build directory we created in step 1: username@host $ ./configure --prefix = $PWD /../hmmer-build username@host $ make username@host $ make install Go back to the previous working directory : username@host $ cd .. and confirm that our installation procedure created bin , lib , and share directories in the hmmer-build folder: username@host $ ls hmmer-build bin share Now we want to package up our installation, so we can use it in other jobs. We can do this by compressing any necessary directories into a single gzipped tarball. username@host $ tar -czf hmmer-build.tar.gz hmmer-build Note that we now have two tarballs in our directory -- the source tarball ( hmmer.tar.gz ), which we will no longer need and our newly built installation ( hmmer-build.tar.gz ) which is what we will actually be using to run jobs.","title":"Installation"},{"location":"materials/software/part5-ex1-prepackaged/#wrapper-script","text":"Now that we've created our portable installation, we need to write a script that opens and uses the installation, similar to the process we used in a previous exercise . These steps should be performed back on the submit server ( ap1.facility.path-cc.io ). Create a script called run_hmmer.sh . The script will first need to untar our installation, so the script should start out like this: #!/bin/bash tar -xzf hmmer-build.tar.gz We're going to use the same $PWD trick from the installation in order to tell the computer how to find HMMER. We will do this by setting the PATH environment variable, to include the directory where HMMER is installed: export PATH = $PWD /hmmer-build/bin: $PATH Finally, the wrapper script needs to not only setup HMMER, but actually run the program. Add the following lines to your run_hmmer.sh wrapper script. hmmbuild globins4.hmm globins4.sto hmmsearch -o search-results.txt globins4.hmm globins45.fa Make sure the wrapper script has executable permissions: username@login $ chmod u+x run_hmmer.sh","title":"Wrapper Script"},{"location":"materials/software/part5-ex1-prepackaged/#run-a-hmmer-job","text":"We're almost ready! We need two more pieces to run a HMMER job. We're going to use some of the tutorial files provided with the HMMER download to run the job. You already have these files back in the directory where you unpacked the source code: username@login $ ls hmmer-3.4/tutorial 7LESS_DROME fn3.hmm globins45.fa globins4.sto MADE1.hmm Pkinase.hmm dna_target.fa fn3.sto globins4.hmm HBB_HUMAN MADE1.sto Pkinase.sto If you don't see these files, you may want to redownload the hmmer.tar.gz file and untar it here. Our last step is to create a submit file for our HMMER job. Think about which lines this submit file will need. Make a copy of a previous submit file (you could use the blast submit file from a previous exercise as a base) and modify it as you think necessary. The two most important lines to modify for this job are listed below; check them against your own submit file: executable = run_hmmer.sh transfer_input_files = hmmer-build.tar.gz, hmmer-3.4/tutorial/ A wrapper script will always be a job's executable . When using a wrapper script, you must also always remember to transfer the software/source code using transfer_input_files . Note The / in the transfer_input_files line indicates that we are transferring the contents of that directory (which in this case, is what we want), rather than the directory itself. Submit the job with condor_submit . Once the job completes, it should produce a search-results.txt file. Note For a very similar compiling example, see this guide on how to compile samtools : Example Software Compilation","title":"Run a HMMER job"},{"location":"materials/software/part5-ex2-python/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 5.2: Using Python, Pre-Built \u00b6 In this exercise, you will install Python, package your installation, and then use it to run jobs. It should take about 20 minutes. Background \u00b6 Objective : Install software (Python) to a folder and run it in a job using a wrapper script. Why learn this? : This is very similar to the previous exercise . Pre-Building \u00b6 The first step in our job process is building a Python installation that we can package up. Create a directory for this exercise on the Access Point and cd into it. Download the Python source code from https://www.python.org/ . username@login $ wget https://www.python.org/ftp/python/3.10.5/Python-3.10.5.tgz First, we have to determine how to install Python to a specific location in our working directory. Untar the Python source tarball ( tar -xzf Python-3.10.5.tgz ) and look at the README.rst file in the Python-3.10.5 directory ( cd Python-3.10.5 ). You'll want to look for the \"Build Instructions\" header. What will the main installation steps be? What command is required for the final installation? Once you've tried to answer these questions, move to the next step. There are some basic installation instructions near the top of the README . Based on that short introduction, we can see the main steps of installation will be: ./configure make make test sudo make install This three-stage process (configure, make, make install) is a common way to install many software packages. The default installation location for Python requires sudo (administrative privileges) to install. However, we'd like to install to a specific location in the working directory so that we can compress that installation directory into a tarball. You can often use an option called -prefix with the configure script to change the default installation directory. Let's see if the Python configure script has this option by using the \"help\" option (as suggested in the README.rst file): username@host $ ./configure --help Sure enough, there's a list of all the different options that can be passed to the configure script, which includes --prefix . (To see the --prefix option, you may need to scroll towards the top of the output.) Therefore, we can use the $PWD command in order to set the path correctly to a custom installation directory. Now let's actually install Python! From the original working directory , create a directory to hold the installation. username@host $ cd ../ username@host $ mkdir python310 Move into the Python-3.10.5 directory and run the installation commands. These may take a few minutes each. username@host $ cd Python-3.10.5 username@host $ ./configure --prefix = $PWD /../python310 username@host $ make username@host $ make install Note The installation instructions in the README.rst file have a make test step between the make and make install steps. As this step isn't strictly necessary (and takes a long time), it's been omitted above. If I move back to the main job working directory, and look in the python subdirectory, I should see a Python installation. username@host $ cd .. username@host $ ls python310/ bin include lib share I have successfully created a self-contained Python installation. Now it just needs to be tarred up! username@host $ tar -czf prebuilt_python.tar.gz python310/ We might want to know how we installed Python for later reference. Enter the following commands to save our history to a file: username@host $ history > python_install.txt Python Script \u00b6 Create a script with the following lines called fib.py . import sys import os if len ( sys . argv ) != 2 : print ( 'Usage: %s MAXIMUM' % ( os . path . basename ( sys . argv [ 0 ]))) sys . exit ( 1 ) maximum = int ( sys . argv [ 1 ]) n1 = n2 = 1 while n2 <= maximum : n1 , n2 = n2 , n1 + n2 print ( 'The greatest Fibonacci number up to %d is %d ' % ( maximum , n1 )) What command line arguments does this script take? Try running it on the submit server. Wrapper Script \u00b6 We now have our Python installation and our Python script - we just need to write a wrapper script to run them. What steps do you think the wrapper script needs to perform? Create a file called run_fib.sh and write them out in plain English before moving to the next step. Our script will need to untar our prebuilt_python.tar.gz file access the python command from our installation to run our fib.py script Try turning your plain English steps into commands that the computer can run. Your final run_fib.sh script should look something like this: #!/bin/bash tar -xzf prebuilt_python.tar.gz python310/bin/python3 fib.py 90 or #!/bin/bash tar -xzf prebuilt_python.tar.gz export PATH = $( pwd ) /python310/bin: $PATH python3 fib.py 90 Make sure your run_fib.sh script is executable. Submit File \u00b6 Make a copy of a previous submit file in your local directory (the submit file from the Use a Wrapper Script exercise might be a good candidate). What changes need to be made to run this Python job? Modify your submit file, then make sure you've included the key lines below: executable = run_fib.sh transfer_input_files = fib.py, prebuilt_python.tar.gz Submit the job using condor_submit . Check the .out file to see if the job completed.","title":"5.2 - Compiling Python and Running Jobs"},{"location":"materials/software/part5-ex2-python/#software-exercise-52-using-python-pre-built","text":"In this exercise, you will install Python, package your installation, and then use it to run jobs. It should take about 20 minutes.","title":"Software Exercise 5.2: Using Python, Pre-Built"},{"location":"materials/software/part5-ex2-python/#background","text":"Objective : Install software (Python) to a folder and run it in a job using a wrapper script. Why learn this? : This is very similar to the previous exercise .","title":"Background"},{"location":"materials/software/part5-ex2-python/#pre-building","text":"The first step in our job process is building a Python installation that we can package up. Create a directory for this exercise on the Access Point and cd into it. Download the Python source code from https://www.python.org/ . username@login $ wget https://www.python.org/ftp/python/3.10.5/Python-3.10.5.tgz First, we have to determine how to install Python to a specific location in our working directory. Untar the Python source tarball ( tar -xzf Python-3.10.5.tgz ) and look at the README.rst file in the Python-3.10.5 directory ( cd Python-3.10.5 ). You'll want to look for the \"Build Instructions\" header. What will the main installation steps be? What command is required for the final installation? Once you've tried to answer these questions, move to the next step. There are some basic installation instructions near the top of the README . Based on that short introduction, we can see the main steps of installation will be: ./configure make make test sudo make install This three-stage process (configure, make, make install) is a common way to install many software packages. The default installation location for Python requires sudo (administrative privileges) to install. However, we'd like to install to a specific location in the working directory so that we can compress that installation directory into a tarball. You can often use an option called -prefix with the configure script to change the default installation directory. Let's see if the Python configure script has this option by using the \"help\" option (as suggested in the README.rst file): username@host $ ./configure --help Sure enough, there's a list of all the different options that can be passed to the configure script, which includes --prefix . (To see the --prefix option, you may need to scroll towards the top of the output.) Therefore, we can use the $PWD command in order to set the path correctly to a custom installation directory. Now let's actually install Python! From the original working directory , create a directory to hold the installation. username@host $ cd ../ username@host $ mkdir python310 Move into the Python-3.10.5 directory and run the installation commands. These may take a few minutes each. username@host $ cd Python-3.10.5 username@host $ ./configure --prefix = $PWD /../python310 username@host $ make username@host $ make install Note The installation instructions in the README.rst file have a make test step between the make and make install steps. As this step isn't strictly necessary (and takes a long time), it's been omitted above. If I move back to the main job working directory, and look in the python subdirectory, I should see a Python installation. username@host $ cd .. username@host $ ls python310/ bin include lib share I have successfully created a self-contained Python installation. Now it just needs to be tarred up! username@host $ tar -czf prebuilt_python.tar.gz python310/ We might want to know how we installed Python for later reference. Enter the following commands to save our history to a file: username@host $ history > python_install.txt","title":"Pre-Building"},{"location":"materials/software/part5-ex2-python/#python-script","text":"Create a script with the following lines called fib.py . import sys import os if len ( sys . argv ) != 2 : print ( 'Usage: %s MAXIMUM' % ( os . path . basename ( sys . argv [ 0 ]))) sys . exit ( 1 ) maximum = int ( sys . argv [ 1 ]) n1 = n2 = 1 while n2 <= maximum : n1 , n2 = n2 , n1 + n2 print ( 'The greatest Fibonacci number up to %d is %d ' % ( maximum , n1 )) What command line arguments does this script take? Try running it on the submit server.","title":"Python Script"},{"location":"materials/software/part5-ex2-python/#wrapper-script","text":"We now have our Python installation and our Python script - we just need to write a wrapper script to run them. What steps do you think the wrapper script needs to perform? Create a file called run_fib.sh and write them out in plain English before moving to the next step. Our script will need to untar our prebuilt_python.tar.gz file access the python command from our installation to run our fib.py script Try turning your plain English steps into commands that the computer can run. Your final run_fib.sh script should look something like this: #!/bin/bash tar -xzf prebuilt_python.tar.gz python310/bin/python3 fib.py 90 or #!/bin/bash tar -xzf prebuilt_python.tar.gz export PATH = $( pwd ) /python310/bin: $PATH python3 fib.py 90 Make sure your run_fib.sh script is executable.","title":"Wrapper Script"},{"location":"materials/software/part5-ex2-python/#submit-file","text":"Make a copy of a previous submit file in your local directory (the submit file from the Use a Wrapper Script exercise might be a good candidate). What changes need to be made to run this Python job? Modify your submit file, then make sure you've included the key lines below: executable = run_fib.sh transfer_input_files = fib.py, prebuilt_python.tar.gz Submit the job using condor_submit . Check the .out file to see if the job completed.","title":"Submit File"},{"location":"materials/software/part5-ex3-conda/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 5.3: Using Conda Environments \u00b6 Objective : Create a portable conda environment and use it in a job. Why learn this? : If you normally use conda to manage your Python environments, this method of software portability offers great similarity to your usual practices. Introduction \u00b6 Many Python users manage their Python installation and environments with either the Anaconda or miniconda distributions. These distribution tools are great for creating portable Python installations and can be used on HTC systems with some help from a tool called conda pack . Sample Script \u00b6 For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library. Create and Pack a Conda Environment \u00b6 (For a generic version of these instructions, see the CHTC User Guide ) Our first step is to create a miniconda installation on the submit server. You should be logged into whichever server you made the rand_array.py script on. Download the latest Linux miniconda installer user@login $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh Run the installer to install miniconda; you'll need to accept the license terms and you can use the default installation location: [user@login]$ sh Miniconda3-latest-Linux-x86_64.sh At the end, you can choose whether or not to \"initialize Miniconda3 by running conda init?\" The default is no; you would then run the eval command listed by the installer to \"activate\" Miniconda. If you choose \"no\" you'll want to save this command so that you can reactivate the Miniconda installation when needed in the future. Next we'll create our conda \"environment\" with numpy (we've called the environment \"py3-numpy\"): (base) [user@login]$ conda create -n py3-numpy (base) [user@login]$ conda activate py3-numpy (py3-numpy) [user@login]$ conda install -c conda-forge numpy Once everything is installed, deactivate the environment to go back to the Miniconda \"base\" environment. (py3-numpy) [user@login]$ conda deactivate We'll now install a tool that will pack up the just created conda environment so we can run it elsewhere. Make sure that your job's Miniconda environment is created, but deactivated, so that you're in the \"base\" Miniconda environment, then run: (base) [user@login]$ conda install -c conda-forge conda-pack Enter y when it asks you to install. Finally, we will run the conda pack command, which will automatically create a tar.gz file with our environment: (base) [user@login]$ conda pack -n py3-numpy Submit a Job \u00b6 The executable for this job will need to be a wrapper script. What steps do you think need to be included? Write down a rough draft, then compare with the following script. Create a wrapper script like the following: #!/bin/bash set -e export PATH mkdir py3-numpy tar -xzf py3-numpy.tar.gz -C py3-numpy . py3-numpy/bin/activate python3 rand_array.py What needs to be included in your submit file for the job to run successfully? Try yourself and then check the suggestions in the next point. In your submit file, make sure to have the following: Your executable should be the the bash script you created in the previous step. Remember to transfer your Python script and the environment tar.gz file via transfer_input_files . Submit the job and see what happens!","title":"5.3 - Using Conda Environments"},{"location":"materials/software/part5-ex3-conda/#software-exercise-53-using-conda-environments","text":"Objective : Create a portable conda environment and use it in a job. Why learn this? : If you normally use conda to manage your Python environments, this method of software portability offers great similarity to your usual practices.","title":"Software Exercise 5.3: Using Conda Environments"},{"location":"materials/software/part5-ex3-conda/#introduction","text":"Many Python users manage their Python installation and environments with either the Anaconda or miniconda distributions. These distribution tools are great for creating portable Python installations and can be used on HTC systems with some help from a tool called conda pack .","title":"Introduction"},{"location":"materials/software/part5-ex3-conda/#sample-script","text":"For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library.","title":"Sample Script"},{"location":"materials/software/part5-ex3-conda/#create-and-pack-a-conda-environment","text":"(For a generic version of these instructions, see the CHTC User Guide ) Our first step is to create a miniconda installation on the submit server. You should be logged into whichever server you made the rand_array.py script on. Download the latest Linux miniconda installer user@login $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh Run the installer to install miniconda; you'll need to accept the license terms and you can use the default installation location: [user@login]$ sh Miniconda3-latest-Linux-x86_64.sh At the end, you can choose whether or not to \"initialize Miniconda3 by running conda init?\" The default is no; you would then run the eval command listed by the installer to \"activate\" Miniconda. If you choose \"no\" you'll want to save this command so that you can reactivate the Miniconda installation when needed in the future. Next we'll create our conda \"environment\" with numpy (we've called the environment \"py3-numpy\"): (base) [user@login]$ conda create -n py3-numpy (base) [user@login]$ conda activate py3-numpy (py3-numpy) [user@login]$ conda install -c conda-forge numpy Once everything is installed, deactivate the environment to go back to the Miniconda \"base\" environment. (py3-numpy) [user@login]$ conda deactivate We'll now install a tool that will pack up the just created conda environment so we can run it elsewhere. Make sure that your job's Miniconda environment is created, but deactivated, so that you're in the \"base\" Miniconda environment, then run: (base) [user@login]$ conda install -c conda-forge conda-pack Enter y when it asks you to install. Finally, we will run the conda pack command, which will automatically create a tar.gz file with our environment: (base) [user@login]$ conda pack -n py3-numpy","title":"Create and Pack a Conda Environment"},{"location":"materials/software/part5-ex3-conda/#submit-a-job","text":"The executable for this job will need to be a wrapper script. What steps do you think need to be included? Write down a rough draft, then compare with the following script. Create a wrapper script like the following: #!/bin/bash set -e export PATH mkdir py3-numpy tar -xzf py3-numpy.tar.gz -C py3-numpy . py3-numpy/bin/activate python3 rand_array.py What needs to be included in your submit file for the job to run successfully? Try yourself and then check the suggestions in the next point. In your submit file, make sure to have the following: Your executable should be the the bash script you created in the previous step. Remember to transfer your Python script and the environment tar.gz file via transfer_input_files . Submit the job and see what happens!","title":"Submit a Job"},{"location":"materials/software/part5-ex4-compiling/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 5.4: Compile Statically Linked Code \u00b6 Objective : Compile code using static linking, explain why this can be useful. Why learn this? : When code is compiled, it is usually linked to other pieces of code on the computer. This can cause it to not work when moved to other computers. Static linking means that all the needed references are included in the compiled code, meaning that it can run almost anywhere. Our Software Example \u00b6 For this compiling example, we will use a script written in C. C code depends on libraries and therefore will benefit from being statically linked. Our C code prints 7 rows of Pascal's triangle. Log into the Access Point. Create a directory for this exercise and cd into it. Copy and paste the following code into a file named pascal.c . #include \"stdio.h\" long factorial ( int ); int main () { int i , n , c ; n = 7 ; for ( i = 0 ; i < n ; i ++ ){ for ( c = 0 ; c <= ( n - i - 2 ); c ++ ) printf ( \" \" ); for ( c = 0 ; c <= i ; c ++ ) printf ( \"%ld \" , factorial ( i ) / ( factorial ( c ) * factorial ( i - c ))); printf ( \" \\n \" ); } return 0 ; } long factorial ( int n ) { int c ; long result = 1 ; for ( c = 1 ; c <= n ; c ++ ) result = result * c ; return result ; } Compiling \u00b6 In order to use this code in a job, we will first need to statically compile the code. Most linux servers (including our Access Point) have the gcc (GNU compiler collection) installed, so we already have a compiler on the Access Point. Furthermore, this is a simple piece of C code, so the compilation will not be computationally intensive. Thus, we should be able to compile directly on the Access Point Compile the code, using the command: username@login $ gcc -static pascal.c -o pascal Note that we have added the -static option to make sure that the compiled binary includes the necessary libraries. This will allow the code to run on any Linux machine, no matter where those libraries are located. Verify that the compiled binary was statically linked: username@login $ file pascal The Linux file command provides information about the type or kind of file that is given as an argument. In this case, you should get output like this: username@host $ file pascal pascal: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.18, not stripped The output clearly states that this executable (software) is statically linked. The same command run on a non-statically linked executable file would include the text dynamically linked (uses shared libs) instead. So with this simple verification step, which could even be run on files that you did not compile yourself, you have some further reassurance that it is safe to use on other Linux machines. (Bonus exercise: Try the file command on lots of other files) Submit the Job \u00b6 Now that our code is compiled, we can use it to submit a job. Think about what submit file lines we need to use to run this job: Are there input files? Are there command line arguments? Where is its output written? Based on what you thought about in 1., find a submit file from earlier that you can modify to run our compiled pascal code. Copy it to the directory with the pascal binary and make those changes. Submit the job using condor_submit . Once the job has run and left the queue, you should be able to see the results (seven rows of Pascal's triangle) in the .out file created by the job.","title":"5.4 - Compiling and Running a Simple Code"},{"location":"materials/software/part5-ex4-compiling/#software-exercise-54-compile-statically-linked-code","text":"Objective : Compile code using static linking, explain why this can be useful. Why learn this? : When code is compiled, it is usually linked to other pieces of code on the computer. This can cause it to not work when moved to other computers. Static linking means that all the needed references are included in the compiled code, meaning that it can run almost anywhere.","title":"Software Exercise 5.4: Compile Statically Linked Code"},{"location":"materials/software/part5-ex4-compiling/#our-software-example","text":"For this compiling example, we will use a script written in C. C code depends on libraries and therefore will benefit from being statically linked. Our C code prints 7 rows of Pascal's triangle. Log into the Access Point. Create a directory for this exercise and cd into it. Copy and paste the following code into a file named pascal.c . #include \"stdio.h\" long factorial ( int ); int main () { int i , n , c ; n = 7 ; for ( i = 0 ; i < n ; i ++ ){ for ( c = 0 ; c <= ( n - i - 2 ); c ++ ) printf ( \" \" ); for ( c = 0 ; c <= i ; c ++ ) printf ( \"%ld \" , factorial ( i ) / ( factorial ( c ) * factorial ( i - c ))); printf ( \" \\n \" ); } return 0 ; } long factorial ( int n ) { int c ; long result = 1 ; for ( c = 1 ; c <= n ; c ++ ) result = result * c ; return result ; }","title":"Our Software Example"},{"location":"materials/software/part5-ex4-compiling/#compiling","text":"In order to use this code in a job, we will first need to statically compile the code. Most linux servers (including our Access Point) have the gcc (GNU compiler collection) installed, so we already have a compiler on the Access Point. Furthermore, this is a simple piece of C code, so the compilation will not be computationally intensive. Thus, we should be able to compile directly on the Access Point Compile the code, using the command: username@login $ gcc -static pascal.c -o pascal Note that we have added the -static option to make sure that the compiled binary includes the necessary libraries. This will allow the code to run on any Linux machine, no matter where those libraries are located. Verify that the compiled binary was statically linked: username@login $ file pascal The Linux file command provides information about the type or kind of file that is given as an argument. In this case, you should get output like this: username@host $ file pascal pascal: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.18, not stripped The output clearly states that this executable (software) is statically linked. The same command run on a non-statically linked executable file would include the text dynamically linked (uses shared libs) instead. So with this simple verification step, which could even be run on files that you did not compile yourself, you have some further reassurance that it is safe to use on other Linux machines. (Bonus exercise: Try the file command on lots of other files)","title":"Compiling"},{"location":"materials/software/part5-ex4-compiling/#submit-the-job","text":"Now that our code is compiled, we can use it to submit a job. Think about what submit file lines we need to use to run this job: Are there input files? Are there command line arguments? Where is its output written? Based on what you thought about in 1., find a submit file from earlier that you can modify to run our compiled pascal code. Copy it to the directory with the pascal binary and make those changes. Submit the job using condor_submit . Once the job has run and left the queue, you should be able to see the results (seven rows of Pascal's triangle) in the .out file created by the job.","title":"Submit the Job"},{"location":"materials/special/part1-ex1-gpus/","text":"Exercise 1.1: GPUs \u00b6 Exploring Availability \u00b6 For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Let's first explore what GPUs are available in the OSPool. Remember that the pool is dynamic - resources are beeing added and removed all the time - but we can at least find out what the current set of GPUs are there. Run: user@ap40 $ condor_status -const 'GPUs > 0' Once you have that list, pick one of the resources and look at the classad using the -l flag. For example: user@ap40 $ condor_status -l [ MACHINE ] Using the -autoformat flag, explore the different attributes of the GPUs. Some interesting attributes might be GPUs_DeviceName , GPUs_Capability , GLIDEIN_Site and GLIDEIN_ResourceName . Compare the Mips number of a GPU slot with a regular slot. Does the Mips number indicate that GPUs can be much faster than CPUs? Why/why not? A sample GPU job \u00b6 Create a file named mytf.py and chmod it to be executable. The content is a sample TensorFlow code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #!/usr/bin/python3 # http://learningtensorflow.com/lesson10/ import sys import numpy as np import tensorflow as tf from datetime import datetime tf . debugging . set_log_device_placement ( True ) # Create some tensors a = tf . constant ([[ 1.0 , 2.0 , 3.0 ], [ 4.0 , 5.0 , 6.0 ]]) b = tf . constant ([[ 1.0 , 2.0 ], [ 3.0 , 4.0 ], [ 5.0 , 6.0 ]]) c = tf . matmul ( a , b ) print ( c ) Then, create a submit file to run the code on a GPU, using a TensorFlow container image. The new bits of the submit file is provided below, but you will have to fill in the rest from what you have learnt earlier in the User School. universe = container container_image = /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.2-cuda-10.1 executable = mytf.py request_gpus = 1 Note that TensorFlow also require the AVX2 CPU extensions. Remember that AVX2 is available in the x86_64-v3 and x86_64-v4 micro architectures. Add a requirements line stating that Microarch has to be one of those two (the operand for or in the classad experssions is || ) Submit the job and watch the queue. Did the job start running as quickly as when we ran CPU jobs? Why/why not? Examine the out/err files. Does it indicate somewhere that the job was mapped to a GPU? (Hint: search for Created TensorFlow device ) Keep a copy of the out/err. Modify the submit file to not run on a GPU, and the try the job again. Did the job work? Does the err from the CPU job look anything like the GPU err?","title":"1.1 - GPUs"},{"location":"materials/special/part1-ex1-gpus/#exercise-11-gpus","text":"","title":"Exercise 1.1: GPUs"},{"location":"materials/special/part1-ex1-gpus/#exploring-availability","text":"For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Let's first explore what GPUs are available in the OSPool. Remember that the pool is dynamic - resources are beeing added and removed all the time - but we can at least find out what the current set of GPUs are there. Run: user@ap40 $ condor_status -const 'GPUs > 0' Once you have that list, pick one of the resources and look at the classad using the -l flag. For example: user@ap40 $ condor_status -l [ MACHINE ] Using the -autoformat flag, explore the different attributes of the GPUs. Some interesting attributes might be GPUs_DeviceName , GPUs_Capability , GLIDEIN_Site and GLIDEIN_ResourceName . Compare the Mips number of a GPU slot with a regular slot. Does the Mips number indicate that GPUs can be much faster than CPUs? Why/why not?","title":"Exploring Availability"},{"location":"materials/special/part1-ex1-gpus/#a-sample-gpu-job","text":"Create a file named mytf.py and chmod it to be executable. The content is a sample TensorFlow code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #!/usr/bin/python3 # http://learningtensorflow.com/lesson10/ import sys import numpy as np import tensorflow as tf from datetime import datetime tf . debugging . set_log_device_placement ( True ) # Create some tensors a = tf . constant ([[ 1.0 , 2.0 , 3.0 ], [ 4.0 , 5.0 , 6.0 ]]) b = tf . constant ([[ 1.0 , 2.0 ], [ 3.0 , 4.0 ], [ 5.0 , 6.0 ]]) c = tf . matmul ( a , b ) print ( c ) Then, create a submit file to run the code on a GPU, using a TensorFlow container image. The new bits of the submit file is provided below, but you will have to fill in the rest from what you have learnt earlier in the User School. universe = container container_image = /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.2-cuda-10.1 executable = mytf.py request_gpus = 1 Note that TensorFlow also require the AVX2 CPU extensions. Remember that AVX2 is available in the x86_64-v3 and x86_64-v4 micro architectures. Add a requirements line stating that Microarch has to be one of those two (the operand for or in the classad experssions is || ) Submit the job and watch the queue. Did the job start running as quickly as when we ran CPU jobs? Why/why not? Examine the out/err files. Does it indicate somewhere that the job was mapped to a GPU? (Hint: search for Created TensorFlow device ) Keep a copy of the out/err. Modify the submit file to not run on a GPU, and the try the job again. Did the job work? Does the err from the CPU job look anything like the GPU err?","title":"A sample GPU job"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/","text":"OSG Exercise 2.1: Troubleshooting Jobs \u00b6 The goal of this exercise is to practice troubleshooting some common problems that you may encounter when submitting jobs using HTCondor. This exercise should work on either of the access points- OSPool or Path Facility Note: This exercise is a little harder than some others. To complete it, you will have to find and fix several issues. Be patient, keep trying, but if you really get stuck, you can ask for help or look at the very bottom of this page for a link to answers. But try not to look at the answers! Acquiring the Materials \u00b6 We have prepared some Python code, data, and submit files for this exercise: Log into an Access Point Download a tarball of the materials: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting.tar.gz Extract the tarball using the commands that you learned earlier Change into the newly extracted directory and explore its contents \u2014 resist the temptation to fix things right away! Solving a Project Euler Problem \u00b6 The contents of the tarball contain a series of submit files, Python scripts, and an input file that are designed to solve Project Euler problem 98 : By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^2. What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^2. We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter. Using p098_words.txt, a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself). What is the largest square number formed by any member of such a pair? NOTE: All anagrams formed must be contained in the given text file. Unfortunately, there are many issues with the submit files that you will have to work through before you can obtain the solution to the problem! The code in the Python scripts themselves is, in theory, free of bugs. Finding anagrams \u00b6 The first step in our workflow takes an input file with a list of words ( p098_words.txt ) and extracts all of the anagrams using the find_anagrams.py script. Naturally, we want to run this as an HTCondor job, so: Submit the accompanying find-anagrams.sub file from the tarball. Resolve any issues that you encounter until the job returns pairs of anagrams as its output. Once you have satisfactory output, move onto the next section. Please be polite Access points are shared resources, so you should clean up after yourself. If you discover any jobs in the Hold state, and after you are done troubleshooting them, remove them with the following command: user@server $ condor_rm -const 'JobStatus =?= 5' Where replacing with... Will remove... Your username (e.g. blin ) All of your held jobs A cluster ID (e.g. 74078 ) All held jobs matching the given cluster ID A job ID (e.g. 97932.30 ) That specific held job Finding the largest square \u00b6 The next step in the workflow uses the max_square.py script to find the largest square number, if any, for a given anagram word pair. Let's submit jobs that run max_square.py for all of the anagram word pairs (i.e., one job per word pair), that you found in the previous section: Submit the accompanying squares.sub file from the tarball Resolve any issues that you encounter until you receive output for each job. Note that some jobs may have empty output since not all anagram word pairs are square anagram word pairs. Next, you can find the largest square among your output by directly using the command line. For example, if all of your job output has been placed in the squares directory and are named square-1.out , square-2.out , etc., then you could run the following command to find the largest square: user@server $ cat squares/square-*.out | sort -n | tail -n 1 You can check if you have the right answer with any of the OSG staff or by submitting the answer to Project Euler (requires an account). Answer Key \u00b6 There is also a working solution on our web server that can be retrieved with user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting-key.tar.gz It contains comments labeled SOLUTION that you can consult in case you get stuck. Like any answer key, it is mainly useful as a verification tool, so try to only use it as a last resort or for detailed explanations to improve your understanding.","title":"1.1 - Troubleshooting Jobs"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#osg-exercise-21-troubleshooting-jobs","text":"The goal of this exercise is to practice troubleshooting some common problems that you may encounter when submitting jobs using HTCondor. This exercise should work on either of the access points- OSPool or Path Facility Note: This exercise is a little harder than some others. To complete it, you will have to find and fix several issues. Be patient, keep trying, but if you really get stuck, you can ask for help or look at the very bottom of this page for a link to answers. But try not to look at the answers!","title":"OSG Exercise 2.1: Troubleshooting Jobs"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#acquiring-the-materials","text":"We have prepared some Python code, data, and submit files for this exercise: Log into an Access Point Download a tarball of the materials: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting.tar.gz Extract the tarball using the commands that you learned earlier Change into the newly extracted directory and explore its contents \u2014 resist the temptation to fix things right away!","title":"Acquiring the Materials"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#solving-a-project-euler-problem","text":"The contents of the tarball contain a series of submit files, Python scripts, and an input file that are designed to solve Project Euler problem 98 : By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^2. What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^2. We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter. Using p098_words.txt, a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself). What is the largest square number formed by any member of such a pair? NOTE: All anagrams formed must be contained in the given text file. Unfortunately, there are many issues with the submit files that you will have to work through before you can obtain the solution to the problem! The code in the Python scripts themselves is, in theory, free of bugs.","title":"Solving a Project Euler Problem"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#finding-anagrams","text":"The first step in our workflow takes an input file with a list of words ( p098_words.txt ) and extracts all of the anagrams using the find_anagrams.py script. Naturally, we want to run this as an HTCondor job, so: Submit the accompanying find-anagrams.sub file from the tarball. Resolve any issues that you encounter until the job returns pairs of anagrams as its output. Once you have satisfactory output, move onto the next section. Please be polite Access points are shared resources, so you should clean up after yourself. If you discover any jobs in the Hold state, and after you are done troubleshooting them, remove them with the following command: user@server $ condor_rm -const 'JobStatus =?= 5' Where replacing with... Will remove... Your username (e.g. blin ) All of your held jobs A cluster ID (e.g. 74078 ) All held jobs matching the given cluster ID A job ID (e.g. 97932.30 ) That specific held job","title":"Finding anagrams"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#finding-the-largest-square","text":"The next step in the workflow uses the max_square.py script to find the largest square number, if any, for a given anagram word pair. Let's submit jobs that run max_square.py for all of the anagram word pairs (i.e., one job per word pair), that you found in the previous section: Submit the accompanying squares.sub file from the tarball Resolve any issues that you encounter until you receive output for each job. Note that some jobs may have empty output since not all anagram word pairs are square anagram word pairs. Next, you can find the largest square among your output by directly using the command line. For example, if all of your job output has been placed in the squares directory and are named square-1.out , square-2.out , etc., then you could run the following command to find the largest square: user@server $ cat squares/square-*.out | sort -n | tail -n 1 You can check if you have the right answer with any of the OSG staff or by submitting the answer to Project Euler (requires an account).","title":"Finding the largest square"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#answer-key","text":"There is also a working solution on our web server that can be retrieved with user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting-key.tar.gz It contains comments labeled SOLUTION that you can consult in case you get stuck. Like any answer key, it is mainly useful as a verification tool, so try to only use it as a last resort or for detailed explanations to improve your understanding.","title":"Answer Key"},{"location":"materials/troubleshooting/part1-ex2-job-retry/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Exercise 1.2: Retries \u00b6 The goal of this exercise is to demonstrate running a job that intermittently fails and thus could benefit from having HTCondor automatically retry it. This first part of the exercise should take only a few minutes, and is designed to setup the next exercises. Bad Job \u00b6 Let\u2019s assume that a colleague has shared with you a program, and it fails once in a while. In the real world, we would probably just fix the program, but what if you cannot change the software? Unfortunately, this situation happens more often than we would like. Below is a Python script that fails once in a while. We will not fix it, but instead use it to simulate a program that can fail and that we cannot fix. #!/usr/bin/env python3 # murphy.py simulates a real program with real problems import random import sys import time # For one out of every three attempts, simulate a runtime error if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. In a new directory for this exercise, save the script above as murphy.py . Write a submit file for the script; queue 20 instances of the job and be sure to ask for 20 MB of memory and disk. Submit the file, note the ClusterId, and wait for the jobs to finish. What output do you expect? What output did you get? If you are curious about the exit code from the job, it is saved in completed jobs in condor_history in the ExitCode attribute. The following command will show the ExitCode for a given cluster of jobs: user@server $ condor_history -af:h ProcId ExitCode (Be sure to replace with your actual cluster ID. The command may take a minute or so to complete.) How many of the jobs succeeded? How many failed? Retrying Failed Jobs \u00b6 Now let\u2019s see if we can solve the problem of jobs that fail once in a while. In this particular case, if HTCondor runs a failed job again, it has a good chance of succeeding. Not all failing jobs are like this, but in this case it is a reasonable assumption. From the lecture materials, implement the max_retries feature to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Did your change work? After the jobs have finished, examine the log file(s) to see what happened in detail. Did any jobs need to be restarted? Another way to see how many restarts there were is to look at the NumJobStarts attribute of a completed job with the condor_history command, in the same way you looked at the ExitCode attribute earlier. Does the number of retries seem correct? For those jobs which did need to be retried, what is their ExitCode ; and what about the ExitCode from earlier execution attempts? A (Too) Long Running Job \u00b6 Sometimes, an ill-behaved job will get stuck in a loop and run forever, instead of exiting with a failure code, and it may just need to be re-run (or run on a different execute server) to complete without getting stuck. We can modify our Python program to simulate this kind of bad job with the following file: #!/usr/bin/env python3 # murphy.py simulate a real program with real problems import random import sys import time # For one out of every three attempts, simulate an \"infinite\" loop if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output time . sleep ( 3600 ) sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. Save the script to a new file named murphy2.py . Copy your previous submit file to a new name and change the executable to murphy2.py . If you like, submit the new file \u2014 but after a while be sure to remove the whole cluster to clear out the \u201chung\u201d jobs. Now try to change the submit file to automatically remove any jobs that run for more than one minute. You can make this change with just a single line in your submit file periodic_remove = (JobStatus == 2) && ( (CurrentTime - EnteredCurrentStatus) > 60 ) Submit the new file. Do the long running jobs get removed? What does condor_history show for the cluster after all jobs are done? Which job status (i.e. idle, held, running) do you think JobStatus == 2 corresponds to? Bonus Exercise \u00b6 If you have time, edit your submit file so that instead of removing long running jobs, HTCondor will automatically put the long-running job on hold, and then automatically release it.","title":"1.2 - Job Retry"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#exercise-12-retries","text":"The goal of this exercise is to demonstrate running a job that intermittently fails and thus could benefit from having HTCondor automatically retry it. This first part of the exercise should take only a few minutes, and is designed to setup the next exercises.","title":"Exercise 1.2: Retries"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#bad-job","text":"Let\u2019s assume that a colleague has shared with you a program, and it fails once in a while. In the real world, we would probably just fix the program, but what if you cannot change the software? Unfortunately, this situation happens more often than we would like. Below is a Python script that fails once in a while. We will not fix it, but instead use it to simulate a program that can fail and that we cannot fix. #!/usr/bin/env python3 # murphy.py simulates a real program with real problems import random import sys import time # For one out of every three attempts, simulate a runtime error if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. In a new directory for this exercise, save the script above as murphy.py . Write a submit file for the script; queue 20 instances of the job and be sure to ask for 20 MB of memory and disk. Submit the file, note the ClusterId, and wait for the jobs to finish. What output do you expect? What output did you get? If you are curious about the exit code from the job, it is saved in completed jobs in condor_history in the ExitCode attribute. The following command will show the ExitCode for a given cluster of jobs: user@server $ condor_history -af:h ProcId ExitCode (Be sure to replace with your actual cluster ID. The command may take a minute or so to complete.) How many of the jobs succeeded? How many failed?","title":"Bad Job"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#retrying-failed-jobs","text":"Now let\u2019s see if we can solve the problem of jobs that fail once in a while. In this particular case, if HTCondor runs a failed job again, it has a good chance of succeeding. Not all failing jobs are like this, but in this case it is a reasonable assumption. From the lecture materials, implement the max_retries feature to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Did your change work? After the jobs have finished, examine the log file(s) to see what happened in detail. Did any jobs need to be restarted? Another way to see how many restarts there were is to look at the NumJobStarts attribute of a completed job with the condor_history command, in the same way you looked at the ExitCode attribute earlier. Does the number of retries seem correct? For those jobs which did need to be retried, what is their ExitCode ; and what about the ExitCode from earlier execution attempts?","title":"Retrying Failed Jobs"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#a-too-long-running-job","text":"Sometimes, an ill-behaved job will get stuck in a loop and run forever, instead of exiting with a failure code, and it may just need to be re-run (or run on a different execute server) to complete without getting stuck. We can modify our Python program to simulate this kind of bad job with the following file: #!/usr/bin/env python3 # murphy.py simulate a real program with real problems import random import sys import time # For one out of every three attempts, simulate an \"infinite\" loop if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output time . sleep ( 3600 ) sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. Save the script to a new file named murphy2.py . Copy your previous submit file to a new name and change the executable to murphy2.py . If you like, submit the new file \u2014 but after a while be sure to remove the whole cluster to clear out the \u201chung\u201d jobs. Now try to change the submit file to automatically remove any jobs that run for more than one minute. You can make this change with just a single line in your submit file periodic_remove = (JobStatus == 2) && ( (CurrentTime - EnteredCurrentStatus) > 60 ) Submit the new file. Do the long running jobs get removed? What does condor_history show for the cluster after all jobs are done? Which job status (i.e. idle, held, running) do you think JobStatus == 2 corresponds to?","title":"A (Too) Long Running Job"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#bonus-exercise","text":"If you have time, edit your submit file so that instead of removing long running jobs, HTCondor will automatically put the long-running job on hold, and then automatically release it.","title":"Bonus Exercise"},{"location":"materials/workflows/part1-ex1-simple-dag/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Workflows Exercise 1.1: Coordinating a Set of Jobs With a Simple DAG \u00b6 The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job. What is DAGMan? \u00b6 In short, DAGMan lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters: DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the HTCondor manual . Submitting a Simple DAG \u00b6 For our job, we will return briefly to the sleep program, name it job.sub executable = /bin/sleep arguments = 4 log = simple.log output = simple.out error = simple.error request_memory = 1GB request_disk = 1GB request_cpus = 1 queue We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. First we will create the most minimal DAG that can be created: a DAG with just one node. Put this into a file named simple.dag . JOB Simple job.sub In your first window, submit the DAG: username@ap40 $ condor_submit_dag simple.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library output : simple.dag.lib.out Log of Condor library error messages : simple.dag.lib.err Log of the life of condor_dagman itself : simple.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 61. ----------------------------------------------------------------------- In the second window, check the queue (what you see may be slightly different): username@ap40 $ condor_q -nobatch -wide:80 -- Submitter: learn.chtc.wisc.edu : <128.104.100.55:9618?sock=28867_10e4_2> : learn.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 61.0 roy 6/21 22:51 0+00:03:47 R 0 0.3 condor_dagman 62.0 roy 6/21 22:51 0+00:00:03 R 0 0.7 simple 4 10 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended In the third window, watch what DAGMan does (what you see may be slightly different): username@ap40 $ tail -f --lines = 500 simple.dag.dagman.out 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 ** condor_scheduniv_exec.271100.0 (CONDOR_DAGMAN) STARTING UP 08/02/24 15:44:57 ** /usr/bin/condor_dagman 08/02/24 15:44:57 ** SubsystemInfo: name=DAGMAN type=DAGMAN(9) class=CLIENT(2) 08/02/24 15:44:57 ** Configuration: subsystem:DAGMAN local: class:CLIENT 08/02/24 15:44:57 ** $CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ 08/02/24 15:44:57 ** $CondorPlatform: x86_64_AlmaLinux8 $ 08/02/24 15:44:57 ** PID = 2340103 08/02/24 15:44:57 ** Log last touched time unavailable (No such file or directory) 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 08/02/24 15:44:57 DaemonCore: No command port requested. 08/02/24 15:44:57 DAGMAN_USE_STRICT setting: 1 08/02/24 15:44:57 DAGMAN_VERBOSITY setting: 3 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DELAY setting: 0 08/02/24 15:44:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 08/02/24 15:44:57 DAGMAN_STARTUP_CYCLE_DETECT setting: False 08/02/24 15:44:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 100 08/02/24 15:44:57 DAGMAN_AGGRESSIVE_SUBMIT setting: False 08/02/24 15:44:57 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 08/02/24 15:44:57 DAGMAN_QUEUE_UPDATE_INTERVAL setting: 300 08/02/24 15:44:57 DAGMAN_DEFAULT_PRIORITY setting: 0 08/02/24 15:44:57 DAGMAN_SUPPRESS_NOTIFICATION setting: True 08/02/24 15:44:57 allow_events (DAGMAN_ALLOW_EVENTS) setting: 114 08/02/24 15:44:57 DAGMAN_RETRY_SUBMIT_FIRST setting: True 08/02/24 15:44:57 DAGMAN_RETRY_NODE_FIRST setting: False 08/02/24 15:44:57 DAGMAN_MAX_JOBS_IDLE setting: 1000 08/02/24 15:44:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 08/02/24 15:44:57 DAGMAN_MAX_PRE_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_POST_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_HOLD_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MUNGE_NODE_NAMES setting: True 08/02/24 15:44:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: False 08/02/24 15:44:57 DAGMAN_ALWAYS_RUN_POST setting: False 08/02/24 15:44:57 DAGMAN_CONDOR_SUBMIT_EXE setting: /usr/bin/condor_submit 08/02/24 15:44:57 DAGMAN_USE_DIRECT_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_APPEND_VARS setting: False 08/02/24 15:44:57 DAGMAN_ABORT_DUPLICATES setting: True 08/02/24 15:44:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 08/02/24 15:44:57 DAGMAN_AUTO_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_MAX_RESCUE_NUM setting: 100 08/02/24 15:44:57 DAGMAN_WRITE_PARTIAL_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log 08/02/24 15:44:57 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True 08/02/24 15:44:57 DAGMAN_MAX_JOB_HOLDS setting: 100 08/02/24 15:44:57 DAGMAN_HOLD_CLAIM_TIME setting: 20 08/02/24 15:44:57 ALL_DEBUG setting: 08/02/24 15:44:57 DAGMAN_DEBUG setting: 08/02/24 15:44:57 DAGMAN_SUPPRESS_JOB_LOGS setting: False 08/02/24 15:44:57 DAGMAN_REMOVE_NODE_JOBS setting: True 08/02/24 15:44:57 DAGMAN will adjust edges after parsing 08/02/24 15:44:57 argv[0] == \"condor_scheduniv_exec.271100.0\" 08/02/24 15:44:57 argv[1] == \"-Lockfile\" 08/02/24 15:44:57 argv[2] == \"simple.dag.lock\" 08/02/24 15:44:57 argv[3] == \"-AutoRescue\" 08/02/24 15:44:57 argv[4] == \"1\" 08/02/24 15:44:57 argv[5] == \"-DoRescueFrom\" 08/02/24 15:44:57 argv[6] == \"0\" 08/02/24 15:44:57 argv[7] == \"-Dag\" 08/02/24 15:44:57 argv[8] == \"simple.dag\" 08/02/24 15:44:57 argv[9] == \"-Suppress_notification\" 08/02/24 15:44:57 argv[10] == \"-CsdVersion\" 08/02/24 15:44:57 argv[11] == \"$CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $\" 08/02/24 15:44:57 argv[12] == \"-Dagman\" 08/02/24 15:44:57 argv[13] == \"/usr/bin/condor_dagman\" 08/02/24 15:44:57 Default node log file is: 08/02/24 15:44:57 DAG Lockfile will be written to simple.dag.lock 08/02/24 15:44:57 DAG Input file is simple.dag 08/02/24 15:44:57 Parsing 1 dagfiles 08/02/24 15:44:57 Parsing simple.dag ... 08/02/24 15:44:57 Adjusting edges 08/02/24 15:44:57 Dag contains 1 total jobs 08/02/24 15:44:57 Bootstrapping... 08/02/24 15:44:57 Number of pre-completed nodes: 0 08/02/24 15:44:57 MultiLogFiles: truncating log file /home/mats.rynge/dagman-1/./simple.dag.nodes.log 08/02/24 15:44:57 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:44:57 Of 1 nodes total: 08/02/24 15:44:57 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:44:57 === === === === === === === === 08/02/24 15:44:57 0 0 0 0 1 0 0 0 08/02/24 15:44:57 0 job proc(s) currently held 08/02/24 15:44:57 Registering condor_event_timer... 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... Here's where the job is submitted 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... 08/02/24 15:44:58 Submitting node Simple from file job.sub using direct job submission 08/02/24 15:44:58 assigned HTCondor ID (271101.0.0) 08/02/24 15:44:58 Just submitted 1 job this cycle... Here's where DAGMan noticed that the job is running 08/02/24 15:45:18 Event: ULOG_EXECUTE for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:14} 08/02/24 15:45:18 Number of idle job procs: 0 Here's where DAGMan noticed that the job finished. 08/02/24 15:45:23 Event: ULOG_JOB_TERMINATED for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:19} 08/02/24 15:45:23 Number of idle job procs: 0 08/02/24 15:45:23 Node Simple job proc (271101.0.0) completed successfully. 08/02/24 15:45:23 Node Simple job completed 08/02/24 15:45:23 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:45:23 Of 1 nodes total: 08/02/24 15:45:23 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:45:23 === === === === === === === === 08/02/24 15:45:23 1 0 0 0 0 0 0 0 Here's where DAGMan noticed that all the work is done. 08/02/24 15:45:23 All jobs Completed! 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxJobs limit (0) 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxIdle limit (1000) 08/02/24 15:45:23 Note: 0 total job deferrals because of node category throttles 08/02/24 15:45:23 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total HOLD script deferrals because of -MaxHold limit (20) or DEFER Now verify your results: username@ap40 $ cat simple.log 000 (271101.000.000) 2024-08-02 15:44:58 Job submitted from host: <128.105.68.92:9618?addrs=128.105.68.92-9618+[2607-f388-2200-100-eaeb-d3ff-fe40-111c]-9618&alias=ap40.uw.osg-htc.org&noUDP&sock=schedd_35391_dc5c> DAG Node: Simple ... 040 (271101.000.000) 2024-08-02 15:45:13 Started transferring input files Transferring to host: <10.136.81.233:37425?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector4#23067238%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b6]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1512850&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-37425&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> ... 040 (271101.000.000) 2024-08-02 15:45:13 Finished transferring input files ... 021 (271101.000.000) 2024-08-02 15:45:14 Warning from starter on slot1_4@glidein_2635188_104012775@comp-cc-0463.gwave.ics.psu.edu: PREPARE_JOB (prepare-hook) succeeded (reported status 000): Using default Singularity image /cvmfs/singularity.opensciencegrid.org/htc/rocky:8-cuda-11.0.3 ... 001 (271101.000.000) 2024-08-02 15:45:14 Job executing on host: <10.136.81.233:39645?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector10#1506459%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b4]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1506644&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-39645&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> SlotName: slot1_4@comp-cc-0463.gwave.ics.psu.edu CondorScratchDir = \"/localscratch/condor/execute/dir_2635172/glide_uZ6qXM/execute/dir_3252113\" Cpus = 1 Disk = 2699079 GLIDEIN_ResourceName = \"PSU-LIGO\" Memory = 1024 ... 006 (271101.000.000) 2024-08-02 15:45:19 Image size of job updated: 2296464 47 - MemoryUsage of job (MB) 47684 - ResidentSetSize of job (KB) ... 040 (271101.000.000) 2024-08-02 15:45:19 Started transferring output files ... 040 (271101.000.000) 2024-08-02 15:45:19 Finished transferring output files ... 005 (271101.000.000) 2024-08-02 15:45:19 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 38416 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 38416 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 149 1048576 2699079 Memory (MB) : 47 1024 1024 Job terminated of its own accord at 2024-08-02T20:45:19Z with exit-code 0. ... Looking at DAGMan's various files, we see that DAGMan itself ran as a job (specifically, a \"scheduler\" universe job). username@ap40 $ ls simple.dag.* simple.dag.condor.sub simple.dag.dagman.log simple.dag.dagman.out simple.dag.lib.err simple.dag.lib.out username@ap40 $ cat simple.dag.condor.sub # Filename: simple.dag.condor.sub # Generated by condor_submit_dag simple.dag universe = scheduler executable = /usr/bin/condor_dagman getenv = CONDOR_CONFIG,_CONDOR_*,PATH,PYTHONPATH,PERL*,PEGASUS_*,TZ,HOME,USER,LANG,LC_ALL output = simple.dag.lib.out error = simple.dag.lib.err log = simple.dag.dagman.log remove_kill_sig = SIGUSR1 +OtherJobRemoveRequirements = \"DAGManJobId =?= $(cluster)\" # Note: default on_exit_remove expression: # ( ExitSignal = ? = 11 || ( ExitCode = ! = UNDEFINED && ExitCode > = 0 && ExitCode < = 2 )) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed ( e.g., during a reboot ) . on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False arguments = \"-p 0 -f -l . -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -Suppress_notification -CsdVersion $CondorVersion:' '10.7.0' '2024-07-10' 'BuildID:' '659788' 'PackageID:' '10.7.0-0.659788' 'RC' '$ -Dagman /usr/bin/condor_dagman\" environment = \"_CONDOR_DAGMAN_LOG=simple.dag.dagman.out _CONDOR_MAX_DAGMAN_LOG=0 _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address _CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad\" queue If you want to clean up some of these files (you may not want to, at least not yet), run: username@ap40 $ rm simple.dag.* Challenge \u00b6 What is the scheduler universe? Why does DAGMan use it? Show hint HTCondor has several universes What would happen to your DAGMan workflow if the access point has to be rebooted? Jobs in the HTCondor queue are \"managed\" - they are always tracked, and restarted automatically if needed","title":"1.1 - A simple DAG"},{"location":"materials/workflows/part1-ex1-simple-dag/#workflows-exercise-11-coordinating-a-set-of-jobs-with-a-simple-dag","text":"The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job.","title":"Workflows Exercise 1.1: Coordinating a Set of Jobs With a Simple DAG"},{"location":"materials/workflows/part1-ex1-simple-dag/#what-is-dagman","text":"In short, DAGMan lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters: DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the HTCondor manual .","title":"What is DAGMan?"},{"location":"materials/workflows/part1-ex1-simple-dag/#submitting-a-simple-dag","text":"For our job, we will return briefly to the sleep program, name it job.sub executable = /bin/sleep arguments = 4 log = simple.log output = simple.out error = simple.error request_memory = 1GB request_disk = 1GB request_cpus = 1 queue We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. First we will create the most minimal DAG that can be created: a DAG with just one node. Put this into a file named simple.dag . JOB Simple job.sub In your first window, submit the DAG: username@ap40 $ condor_submit_dag simple.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library output : simple.dag.lib.out Log of Condor library error messages : simple.dag.lib.err Log of the life of condor_dagman itself : simple.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 61. ----------------------------------------------------------------------- In the second window, check the queue (what you see may be slightly different): username@ap40 $ condor_q -nobatch -wide:80 -- Submitter: learn.chtc.wisc.edu : <128.104.100.55:9618?sock=28867_10e4_2> : learn.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 61.0 roy 6/21 22:51 0+00:03:47 R 0 0.3 condor_dagman 62.0 roy 6/21 22:51 0+00:00:03 R 0 0.7 simple 4 10 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended In the third window, watch what DAGMan does (what you see may be slightly different): username@ap40 $ tail -f --lines = 500 simple.dag.dagman.out 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 ** condor_scheduniv_exec.271100.0 (CONDOR_DAGMAN) STARTING UP 08/02/24 15:44:57 ** /usr/bin/condor_dagman 08/02/24 15:44:57 ** SubsystemInfo: name=DAGMAN type=DAGMAN(9) class=CLIENT(2) 08/02/24 15:44:57 ** Configuration: subsystem:DAGMAN local: class:CLIENT 08/02/24 15:44:57 ** $CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ 08/02/24 15:44:57 ** $CondorPlatform: x86_64_AlmaLinux8 $ 08/02/24 15:44:57 ** PID = 2340103 08/02/24 15:44:57 ** Log last touched time unavailable (No such file or directory) 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 08/02/24 15:44:57 DaemonCore: No command port requested. 08/02/24 15:44:57 DAGMAN_USE_STRICT setting: 1 08/02/24 15:44:57 DAGMAN_VERBOSITY setting: 3 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DELAY setting: 0 08/02/24 15:44:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 08/02/24 15:44:57 DAGMAN_STARTUP_CYCLE_DETECT setting: False 08/02/24 15:44:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 100 08/02/24 15:44:57 DAGMAN_AGGRESSIVE_SUBMIT setting: False 08/02/24 15:44:57 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 08/02/24 15:44:57 DAGMAN_QUEUE_UPDATE_INTERVAL setting: 300 08/02/24 15:44:57 DAGMAN_DEFAULT_PRIORITY setting: 0 08/02/24 15:44:57 DAGMAN_SUPPRESS_NOTIFICATION setting: True 08/02/24 15:44:57 allow_events (DAGMAN_ALLOW_EVENTS) setting: 114 08/02/24 15:44:57 DAGMAN_RETRY_SUBMIT_FIRST setting: True 08/02/24 15:44:57 DAGMAN_RETRY_NODE_FIRST setting: False 08/02/24 15:44:57 DAGMAN_MAX_JOBS_IDLE setting: 1000 08/02/24 15:44:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 08/02/24 15:44:57 DAGMAN_MAX_PRE_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_POST_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_HOLD_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MUNGE_NODE_NAMES setting: True 08/02/24 15:44:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: False 08/02/24 15:44:57 DAGMAN_ALWAYS_RUN_POST setting: False 08/02/24 15:44:57 DAGMAN_CONDOR_SUBMIT_EXE setting: /usr/bin/condor_submit 08/02/24 15:44:57 DAGMAN_USE_DIRECT_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_APPEND_VARS setting: False 08/02/24 15:44:57 DAGMAN_ABORT_DUPLICATES setting: True 08/02/24 15:44:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 08/02/24 15:44:57 DAGMAN_AUTO_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_MAX_RESCUE_NUM setting: 100 08/02/24 15:44:57 DAGMAN_WRITE_PARTIAL_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log 08/02/24 15:44:57 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True 08/02/24 15:44:57 DAGMAN_MAX_JOB_HOLDS setting: 100 08/02/24 15:44:57 DAGMAN_HOLD_CLAIM_TIME setting: 20 08/02/24 15:44:57 ALL_DEBUG setting: 08/02/24 15:44:57 DAGMAN_DEBUG setting: 08/02/24 15:44:57 DAGMAN_SUPPRESS_JOB_LOGS setting: False 08/02/24 15:44:57 DAGMAN_REMOVE_NODE_JOBS setting: True 08/02/24 15:44:57 DAGMAN will adjust edges after parsing 08/02/24 15:44:57 argv[0] == \"condor_scheduniv_exec.271100.0\" 08/02/24 15:44:57 argv[1] == \"-Lockfile\" 08/02/24 15:44:57 argv[2] == \"simple.dag.lock\" 08/02/24 15:44:57 argv[3] == \"-AutoRescue\" 08/02/24 15:44:57 argv[4] == \"1\" 08/02/24 15:44:57 argv[5] == \"-DoRescueFrom\" 08/02/24 15:44:57 argv[6] == \"0\" 08/02/24 15:44:57 argv[7] == \"-Dag\" 08/02/24 15:44:57 argv[8] == \"simple.dag\" 08/02/24 15:44:57 argv[9] == \"-Suppress_notification\" 08/02/24 15:44:57 argv[10] == \"-CsdVersion\" 08/02/24 15:44:57 argv[11] == \"$CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $\" 08/02/24 15:44:57 argv[12] == \"-Dagman\" 08/02/24 15:44:57 argv[13] == \"/usr/bin/condor_dagman\" 08/02/24 15:44:57 Default node log file is: 08/02/24 15:44:57 DAG Lockfile will be written to simple.dag.lock 08/02/24 15:44:57 DAG Input file is simple.dag 08/02/24 15:44:57 Parsing 1 dagfiles 08/02/24 15:44:57 Parsing simple.dag ... 08/02/24 15:44:57 Adjusting edges 08/02/24 15:44:57 Dag contains 1 total jobs 08/02/24 15:44:57 Bootstrapping... 08/02/24 15:44:57 Number of pre-completed nodes: 0 08/02/24 15:44:57 MultiLogFiles: truncating log file /home/mats.rynge/dagman-1/./simple.dag.nodes.log 08/02/24 15:44:57 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:44:57 Of 1 nodes total: 08/02/24 15:44:57 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:44:57 === === === === === === === === 08/02/24 15:44:57 0 0 0 0 1 0 0 0 08/02/24 15:44:57 0 job proc(s) currently held 08/02/24 15:44:57 Registering condor_event_timer... 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... Here's where the job is submitted 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... 08/02/24 15:44:58 Submitting node Simple from file job.sub using direct job submission 08/02/24 15:44:58 assigned HTCondor ID (271101.0.0) 08/02/24 15:44:58 Just submitted 1 job this cycle... Here's where DAGMan noticed that the job is running 08/02/24 15:45:18 Event: ULOG_EXECUTE for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:14} 08/02/24 15:45:18 Number of idle job procs: 0 Here's where DAGMan noticed that the job finished. 08/02/24 15:45:23 Event: ULOG_JOB_TERMINATED for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:19} 08/02/24 15:45:23 Number of idle job procs: 0 08/02/24 15:45:23 Node Simple job proc (271101.0.0) completed successfully. 08/02/24 15:45:23 Node Simple job completed 08/02/24 15:45:23 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:45:23 Of 1 nodes total: 08/02/24 15:45:23 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:45:23 === === === === === === === === 08/02/24 15:45:23 1 0 0 0 0 0 0 0 Here's where DAGMan noticed that all the work is done. 08/02/24 15:45:23 All jobs Completed! 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxJobs limit (0) 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxIdle limit (1000) 08/02/24 15:45:23 Note: 0 total job deferrals because of node category throttles 08/02/24 15:45:23 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total HOLD script deferrals because of -MaxHold limit (20) or DEFER Now verify your results: username@ap40 $ cat simple.log 000 (271101.000.000) 2024-08-02 15:44:58 Job submitted from host: <128.105.68.92:9618?addrs=128.105.68.92-9618+[2607-f388-2200-100-eaeb-d3ff-fe40-111c]-9618&alias=ap40.uw.osg-htc.org&noUDP&sock=schedd_35391_dc5c> DAG Node: Simple ... 040 (271101.000.000) 2024-08-02 15:45:13 Started transferring input files Transferring to host: <10.136.81.233:37425?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector4#23067238%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b6]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1512850&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-37425&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> ... 040 (271101.000.000) 2024-08-02 15:45:13 Finished transferring input files ... 021 (271101.000.000) 2024-08-02 15:45:14 Warning from starter on slot1_4@glidein_2635188_104012775@comp-cc-0463.gwave.ics.psu.edu: PREPARE_JOB (prepare-hook) succeeded (reported status 000): Using default Singularity image /cvmfs/singularity.opensciencegrid.org/htc/rocky:8-cuda-11.0.3 ... 001 (271101.000.000) 2024-08-02 15:45:14 Job executing on host: <10.136.81.233:39645?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector10#1506459%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b4]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1506644&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-39645&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> SlotName: slot1_4@comp-cc-0463.gwave.ics.psu.edu CondorScratchDir = \"/localscratch/condor/execute/dir_2635172/glide_uZ6qXM/execute/dir_3252113\" Cpus = 1 Disk = 2699079 GLIDEIN_ResourceName = \"PSU-LIGO\" Memory = 1024 ... 006 (271101.000.000) 2024-08-02 15:45:19 Image size of job updated: 2296464 47 - MemoryUsage of job (MB) 47684 - ResidentSetSize of job (KB) ... 040 (271101.000.000) 2024-08-02 15:45:19 Started transferring output files ... 040 (271101.000.000) 2024-08-02 15:45:19 Finished transferring output files ... 005 (271101.000.000) 2024-08-02 15:45:19 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 38416 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 38416 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 149 1048576 2699079 Memory (MB) : 47 1024 1024 Job terminated of its own accord at 2024-08-02T20:45:19Z with exit-code 0. ... Looking at DAGMan's various files, we see that DAGMan itself ran as a job (specifically, a \"scheduler\" universe job). username@ap40 $ ls simple.dag.* simple.dag.condor.sub simple.dag.dagman.log simple.dag.dagman.out simple.dag.lib.err simple.dag.lib.out username@ap40 $ cat simple.dag.condor.sub # Filename: simple.dag.condor.sub # Generated by condor_submit_dag simple.dag universe = scheduler executable = /usr/bin/condor_dagman getenv = CONDOR_CONFIG,_CONDOR_*,PATH,PYTHONPATH,PERL*,PEGASUS_*,TZ,HOME,USER,LANG,LC_ALL output = simple.dag.lib.out error = simple.dag.lib.err log = simple.dag.dagman.log remove_kill_sig = SIGUSR1 +OtherJobRemoveRequirements = \"DAGManJobId =?= $(cluster)\" # Note: default on_exit_remove expression: # ( ExitSignal = ? = 11 || ( ExitCode = ! = UNDEFINED && ExitCode > = 0 && ExitCode < = 2 )) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed ( e.g., during a reboot ) . on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False arguments = \"-p 0 -f -l . -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -Suppress_notification -CsdVersion $CondorVersion:' '10.7.0' '2024-07-10' 'BuildID:' '659788' 'PackageID:' '10.7.0-0.659788' 'RC' '$ -Dagman /usr/bin/condor_dagman\" environment = \"_CONDOR_DAGMAN_LOG=simple.dag.dagman.out _CONDOR_MAX_DAGMAN_LOG=0 _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address _CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad\" queue If you want to clean up some of these files (you may not want to, at least not yet), run: username@ap40 $ rm simple.dag.*","title":"Submitting a Simple DAG"},{"location":"materials/workflows/part1-ex1-simple-dag/#challenge","text":"What is the scheduler universe? Why does DAGMan use it? Show hint HTCondor has several universes What would happen to your DAGMan workflow if the access point has to be rebooted? Jobs in the HTCondor queue are \"managed\" - they are always tracked, and restarted automatically if needed","title":"Challenge"},{"location":"materials/workflows/part1-ex2-mandelbrot/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Workflows Exercise 1.2: A Brief Detour Through the Mandelbrot Set \u00b6 Before we explore using DAGs to implement workflows, let\u2019s get a more interesting job. Let\u2019s make pretty pictures! We have a small program that draws pictures of the Mandelbrot set. You can read about the Mandelbrot set on Wikipedia , or you can simply appreciate the pretty pictures. It\u2019s a fractal. We have a simple program that can draw the Mandelbrot set. It's called goatbrot . Before beginning, ensure that you are connected to ap40.uw.osg-htc.org . Create a directory for this exercise and cd into it. Running goatbrot From the Command Line \u00b6 You can generate the Mandelbrot set as a quick test with two simple commands. Download the goatbrot executable: username@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/goatbrot username@ap40 $ chmod a+x goatbrot Generate a PPM image of the Mandelbrot set: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c 0 ,0 -w 3 -s 1000 ,1000 The goatbroat program takes several parameters. Let's break them down: -i 1000 The number of iterations. Bigger numbers generate more accurate images but are slower to run. -o tile_000000_000000.ppm The output file to generate. -c 0,0 The center point of the image. Here it is the point (0,0). -w 3 The width of the image. Here is 3. -s 1000,1000 The size of the final image. Here we generate a picture that is 1000 pixels wide and 1000 pixels tall. Convert the image to the JPEG format (using a built-in program called convert ): username@ap40 $ convert tile_000000_000000.ppm mandel.jpg Dividing the Work into Smaller Pieces \u00b6 The Mandelbrot set can take a while to create, particularly if you make the iterations large or the image size large. What if we broke the creation of the image into multiple invocations (an HTC approach!) then stitched them together? Once we do that, we can run each goatbroat in parallel in our cluster. Here's an example you can run by hand. Run goatbroat 4 times: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c -0.75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000001.ppm -c 0 .75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000000.ppm -c -0.75,-0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000001.ppm -c 0 .75,-0.75 -w 1 .5 -s 500 ,500 Stitch the small images together into the complete image (in JPEG format): username@ap40 $ montage tile_000000_000000.ppm tile_000000_000001.ppm tile_000001_000000.ppm tile_000001_000001.ppm -mode Concatenate -tile 2x2 mandel.jpg This will produce the same image as above. We divided the image space into a 2\u00d72 grid and ran goatbrot on each section of the grid. The built-in montage program stitches the files together and writes out the final image in JPEG format. View the Image! \u00b6 Run the commands above so that you have the Mandelbrot image. When you create the image, you might wonder how you can view it. Use scp or sftp to copy the mandel.jpg back to your computer to view it.","title":"1.2 - A brief detour through the Mandelbrot set"},{"location":"materials/workflows/part1-ex2-mandelbrot/#workflows-exercise-12-a-brief-detour-through-the-mandelbrot-set","text":"Before we explore using DAGs to implement workflows, let\u2019s get a more interesting job. Let\u2019s make pretty pictures! We have a small program that draws pictures of the Mandelbrot set. You can read about the Mandelbrot set on Wikipedia , or you can simply appreciate the pretty pictures. It\u2019s a fractal. We have a simple program that can draw the Mandelbrot set. It's called goatbrot . Before beginning, ensure that you are connected to ap40.uw.osg-htc.org . Create a directory for this exercise and cd into it.","title":"Workflows Exercise 1.2: A Brief Detour Through the Mandelbrot Set"},{"location":"materials/workflows/part1-ex2-mandelbrot/#running-goatbrot-from-the-command-line","text":"You can generate the Mandelbrot set as a quick test with two simple commands. Download the goatbrot executable: username@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/goatbrot username@ap40 $ chmod a+x goatbrot Generate a PPM image of the Mandelbrot set: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c 0 ,0 -w 3 -s 1000 ,1000 The goatbroat program takes several parameters. Let's break them down: -i 1000 The number of iterations. Bigger numbers generate more accurate images but are slower to run. -o tile_000000_000000.ppm The output file to generate. -c 0,0 The center point of the image. Here it is the point (0,0). -w 3 The width of the image. Here is 3. -s 1000,1000 The size of the final image. Here we generate a picture that is 1000 pixels wide and 1000 pixels tall. Convert the image to the JPEG format (using a built-in program called convert ): username@ap40 $ convert tile_000000_000000.ppm mandel.jpg","title":"Running goatbrot From the Command Line"},{"location":"materials/workflows/part1-ex2-mandelbrot/#dividing-the-work-into-smaller-pieces","text":"The Mandelbrot set can take a while to create, particularly if you make the iterations large or the image size large. What if we broke the creation of the image into multiple invocations (an HTC approach!) then stitched them together? Once we do that, we can run each goatbroat in parallel in our cluster. Here's an example you can run by hand. Run goatbroat 4 times: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c -0.75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000001.ppm -c 0 .75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000000.ppm -c -0.75,-0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000001.ppm -c 0 .75,-0.75 -w 1 .5 -s 500 ,500 Stitch the small images together into the complete image (in JPEG format): username@ap40 $ montage tile_000000_000000.ppm tile_000000_000001.ppm tile_000001_000000.ppm tile_000001_000001.ppm -mode Concatenate -tile 2x2 mandel.jpg This will produce the same image as above. We divided the image space into a 2\u00d72 grid and ran goatbrot on each section of the grid. The built-in montage program stitches the files together and writes out the final image in JPEG format.","title":"Dividing the Work into Smaller Pieces"},{"location":"materials/workflows/part1-ex2-mandelbrot/#view-the-image","text":"Run the commands above so that you have the Mandelbrot image. When you create the image, you might wonder how you can view it. Use scp or sftp to copy the mandel.jpg back to your computer to view it.","title":"View the Image!"},{"location":"materials/workflows/part1-ex3-complex-dag/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Workflows Exercise 1.3: A More Complex DAG \u00b6 The objective of this exercise is to run a real set of jobs with DAGMan. Make Your Job Submission Files \u00b6 We'll run our goatbrot example. If you didn't read about it yet, please do so now . We are going to make a DAG with four simultaneous jobs ( goatbrot ) and one final node to stitch them together ( montage ). This means we have five jobs. We're going to run goatbrot with more iterations (100,000) so each job will take longer to run. You can create your five jobs. The goatbrot jobs are very similar to each other, but they have slightly different parameters and output files. goatbrot1.sub \u00b6 executable = goatbrot arguments = -i 100000 -c -0.75,0.75 -w 1.5 -s 500,500 -o tile_0_0.ppm log = goatbrot.log output = goatbrot.out.0.0 error = goatbrot.err.0.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue goatbrot2.sub \u00b6 executable = goatbrot arguments = -i 100000 -c 0.75,0.75 -w 1.5 -s 500,500 -o tile_0_1.ppm log = goatbrot.log output = goatbrot.out.0.1 error = goatbrot.err.0.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue goatbrot3.sub \u00b6 executable = goatbrot arguments = -i 100000 -c -0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_0.ppm log = goatbrot.log output = goatbrot.out.1.0 error = goatbrot.err.1.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue goatbrot4.sub \u00b6 executable = goatbrot arguments = -i 100000 -c 0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_1.ppm log = goatbrot.log output = goatbrot.out.1.1 error = goatbrot.err.1.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue montage.sub \u00b6 You should notice that the transfer_input_files statement refers to the files created by the other jobs. +SingularityImage = \"/cvmfs/singularity.opensciencegrid.org/htc/rocky:9\" executable = montage.sh arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandel-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Notice that the job specified by montage.sub uses a container image, as indicated by the +SingularityImage flag. This is because montage uses libraries that are not installed on the execution nodes. We use a container with montage installed and call it using the executable montage.sh ; thus we will need to create the file montage.sh . #!/bin/bash # Pass all arguments to montage montage \"$@\" Make your DAG \u00b6 In a file called goatbrot.dag , you have your DAG specification: JOB g1 goatbrot1.sub JOB g2 goatbrot2.sub JOB g3 goatbrot3.sub JOB g4 goatbrot4.sub JOB montage montage.sub PARENT g1 g2 g3 g4 CHILD montage Ask yourself: do you know how we ensure that all the goatbrot commands can run simultaneously and all of them will complete before we run the montage job? Running the DAG \u00b6 Submit your DAG: username@learn $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 71. ----------------------------------------------------------------------- Watch Your DAG \u00b6 Let\u2019s follow the progress of the whole DAG: Use the condor_watch_q command to keep an eye on the running jobs. See more information about this tool here . username@learn $ condor_watch_q If you're quick enough, you may have seen DAGMan running as the lone job, before it submitted additional job nodes: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 - 1 222059.0 [=============================================================================] Total: 1 jobs; 1 running Updated at 2024-07-28 13:52:57 DAGMan has submitted the goatbrot jobs, but they haven't started running yet BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 4 1 - 5 222059.0 ... 222063.0 [===============--------------------------------------------------------------] Total: 5 jobs; 4 idle, 1 running Updated at 2024-07-28 13:53:53 They're running BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 5 - 5 222059.0 ... 222063.0 [=============================================================================] Total: 5 jobs; 5 running Updated at 2024-07-28 13:54:33 They finished, but DAGMan hasn't noticed yet. It only checks periodically: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 4 - 5 222059.0 ... 222063.0 [##############################################################===============] Total: 5 jobs; 4 completed, 1 running Updated at 2024-07-28 13:55:13 Eventually, you'll see the montage job submitted, then running, then leave the queue, and then DAGMan will leave the queue. Examine your results. For some reason, goatbrot prints everything to stderr, not stdout. username@learn $ cat goatbrot.err.0.0 Complex image: Center: -0.75 + 0.75i Width: 1.5 Height: 1.5 Upper Left: -1.5 + 1.5i Lower Right: 0 + 0i Output image: Filename: tile_0_0.ppm Width, Height: 500, 500 Theme: beej Antialiased: no Mandelbrot: Max Iterations: 100000 Continuous: no Goatbrot: Multithreading: not supported in this build Completed: 100.0% Examine your log files ( goatbrot.log and montage.log ) and DAGMan output file ( goatbrot.dag.dagman.out ). Do they look as you expect? Can you see the progress of the DAG in the DAGMan output file? As you did earlier, transfer the resulting mandel-from-dag.jpg to your computer so that you can view the image. Does the image look correct? Clean up your results by removing all of the goatbrot.dag.* files if you like. Be careful to not delete the goatbrot.dag file. Bonus Challenge \u00b6 Re-run your DAG. When jobs are running, try condor_q -nobatch -dag . What does it do differently? Challenge, if you have time: Make a bigger DAG by making more tiles in the same area.","title":"1.3 - A more complex DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#workflows-exercise-13-a-more-complex-dag","text":"The objective of this exercise is to run a real set of jobs with DAGMan.","title":"Workflows Exercise 1.3: A More Complex DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#make-your-job-submission-files","text":"We'll run our goatbrot example. If you didn't read about it yet, please do so now . We are going to make a DAG with four simultaneous jobs ( goatbrot ) and one final node to stitch them together ( montage ). This means we have five jobs. We're going to run goatbrot with more iterations (100,000) so each job will take longer to run. You can create your five jobs. The goatbrot jobs are very similar to each other, but they have slightly different parameters and output files.","title":"Make Your Job Submission Files"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot1sub","text":"executable = goatbrot arguments = -i 100000 -c -0.75,0.75 -w 1.5 -s 500,500 -o tile_0_0.ppm log = goatbrot.log output = goatbrot.out.0.0 error = goatbrot.err.0.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot1.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot2sub","text":"executable = goatbrot arguments = -i 100000 -c 0.75,0.75 -w 1.5 -s 500,500 -o tile_0_1.ppm log = goatbrot.log output = goatbrot.out.0.1 error = goatbrot.err.0.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot2.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot3sub","text":"executable = goatbrot arguments = -i 100000 -c -0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_0.ppm log = goatbrot.log output = goatbrot.out.1.0 error = goatbrot.err.1.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot3.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot4sub","text":"executable = goatbrot arguments = -i 100000 -c 0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_1.ppm log = goatbrot.log output = goatbrot.out.1.1 error = goatbrot.err.1.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot4.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#montagesub","text":"You should notice that the transfer_input_files statement refers to the files created by the other jobs. +SingularityImage = \"/cvmfs/singularity.opensciencegrid.org/htc/rocky:9\" executable = montage.sh arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandel-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Notice that the job specified by montage.sub uses a container image, as indicated by the +SingularityImage flag. This is because montage uses libraries that are not installed on the execution nodes. We use a container with montage installed and call it using the executable montage.sh ; thus we will need to create the file montage.sh . #!/bin/bash # Pass all arguments to montage montage \"$@\"","title":"montage.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#make-your-dag","text":"In a file called goatbrot.dag , you have your DAG specification: JOB g1 goatbrot1.sub JOB g2 goatbrot2.sub JOB g3 goatbrot3.sub JOB g4 goatbrot4.sub JOB montage montage.sub PARENT g1 g2 g3 g4 CHILD montage Ask yourself: do you know how we ensure that all the goatbrot commands can run simultaneously and all of them will complete before we run the montage job?","title":"Make your DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#running-the-dag","text":"Submit your DAG: username@learn $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 71. -----------------------------------------------------------------------","title":"Running the DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#watch-your-dag","text":"Let\u2019s follow the progress of the whole DAG: Use the condor_watch_q command to keep an eye on the running jobs. See more information about this tool here . username@learn $ condor_watch_q If you're quick enough, you may have seen DAGMan running as the lone job, before it submitted additional job nodes: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 - 1 222059.0 [=============================================================================] Total: 1 jobs; 1 running Updated at 2024-07-28 13:52:57 DAGMan has submitted the goatbrot jobs, but they haven't started running yet BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 4 1 - 5 222059.0 ... 222063.0 [===============--------------------------------------------------------------] Total: 5 jobs; 4 idle, 1 running Updated at 2024-07-28 13:53:53 They're running BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 5 - 5 222059.0 ... 222063.0 [=============================================================================] Total: 5 jobs; 5 running Updated at 2024-07-28 13:54:33 They finished, but DAGMan hasn't noticed yet. It only checks periodically: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 4 - 5 222059.0 ... 222063.0 [##############################################################===============] Total: 5 jobs; 4 completed, 1 running Updated at 2024-07-28 13:55:13 Eventually, you'll see the montage job submitted, then running, then leave the queue, and then DAGMan will leave the queue. Examine your results. For some reason, goatbrot prints everything to stderr, not stdout. username@learn $ cat goatbrot.err.0.0 Complex image: Center: -0.75 + 0.75i Width: 1.5 Height: 1.5 Upper Left: -1.5 + 1.5i Lower Right: 0 + 0i Output image: Filename: tile_0_0.ppm Width, Height: 500, 500 Theme: beej Antialiased: no Mandelbrot: Max Iterations: 100000 Continuous: no Goatbrot: Multithreading: not supported in this build Completed: 100.0% Examine your log files ( goatbrot.log and montage.log ) and DAGMan output file ( goatbrot.dag.dagman.out ). Do they look as you expect? Can you see the progress of the DAG in the DAGMan output file? As you did earlier, transfer the resulting mandel-from-dag.jpg to your computer so that you can view the image. Does the image look correct? Clean up your results by removing all of the goatbrot.dag.* files if you like. Be careful to not delete the goatbrot.dag file.","title":"Watch Your DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#bonus-challenge","text":"Re-run your DAG. When jobs are running, try condor_q -nobatch -dag . What does it do differently? Challenge, if you have time: Make a bigger DAG by making more tiles in the same area.","title":"Bonus Challenge"},{"location":"materials/workflows/part1-ex4-failed-dag/","text":"Workflows Exercise 1.4: Handling a DAG That Fails \u00b6 The objective of this exercise is to help you learn how DAGMan deals with job failures. DAGMan is built to help you recover from such failures. Background \u00b6 DAGMan can handle a situation where some of the nodes in a DAG fail. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed. Breaking Things \u00b6 Recall that DAGMan decides that a jobs fails if its exit code is non-zero. Let's modify our montage job so that it fails. Work in the same directory where you did the last DAG. Edit montage.sub to add a -h to the arguments. It will look like this, with the -h at the beginning of the highlighted line: executable = /usr/bin/montage arguments = -h tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Submit the DAG again: username@ap40 $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 77. ----------------------------------------------------------------------- Use watch to watch the jobs until they finish. In a separate window, use tail --lines=500 -f goatbrot.dag.dagman.out to watch what DAGMan does. 06/22/24 17:57:41 Setting maximum accepts per cycle 8. 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP 06/22/24 17:57:41 ** /usr/bin/condor_dagman 06/22/24 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/22/24 17:57:41 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/22/24 17:57:41 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/22/24 17:57:41 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/22/24 17:57:41 ** PID = 26867 06/22/24 17:57:41 ** Log last touched time unavailable (No such file or directory) 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 Using config source: /etc/condor/condor_config 06/22/24 17:57:41 Using local config sources: 06/22/24 17:57:41 /etc/condor/config.d/00-chtc-global.conf 06/22/24 17:57:41 /etc/condor/config.d/01-chtc-submit.conf 06/22/24 17:57:41 /etc/condor/config.d/02-chtc-flocking.conf 06/22/24 17:57:41 /etc/condor/config.d/03-chtc-jobrouter.conf 06/22/24 17:57:41 /etc/condor/config.d/04-chtc-blacklist.conf 06/22/24 17:57:41 /etc/condor/config.d/99-osg-ss-group.conf 06/22/24 17:57:41 /etc/condor/config.d/99-roy-extras.conf 06/22/24 17:57:41 /etc/condor/condor_config.local Below is where DAGMan realizes that the montage node failed: 06/22/24 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0) 06/22/24 18:08:42 Node montage job proc (82.0.0) failed with status 1. 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Of 5 nodes total: 06/22/24 18:08:42 Done Pre Queued Post Ready Un-Ready Failed 06/22/24 18:08:42 === === === === === === === 06/22/24 18:08:42 4 0 0 0 0 0 1 06/22/24 18:08:42 0 job proc(s) currently held 06/22/24 18:08:42 Aborting DAG... 06/22/24 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001... 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of node category throttles 06/22/24 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/22/24 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/22/24 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1 DAGMan notices that one of the jobs failed because its exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved. Do you see the part where it wrote the rescue DAG? Look at the rescue DAG file. It's called a partial DAG because it indicates what part of the DAG has already been completed. username@ap40 $ cat goatbrot.dag.rescue001 # Rescue DAG file, created after running # the goatbrot.dag DAG file # Created 6 /22/2024 23 :08:42 UTC # Rescue DAG version: 2 .0.1 ( partial ) # # Total number of Nodes: 5 # Nodes premarked DONE: 4 # Nodes that failed: 1 # montage, DONE g1 DONE g2 DONE g3 DONE g4 From the comment near the top, we know that the montage node failed. Let's fix it by getting rid of the offending -h argument. Change montage.sub to look like: executable = /usr/bin/montage arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Now we can re-submit our original DAG and DAGMan will pick up where it left off. It will automatically notice the rescue DAG. If you didn't fix the problem, DAGMan would generate another rescue DAG. username@ap40 $ condor_submit_dag goatbrot.dag Running rescue DAG 1 ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 83. ----------------------------------------------------------------------- username@ap40 $ tail -f goatbrot.dag.dagman.out 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP 06/23/24 11:30:53 ** /usr/bin/condor_dagman 06/23/24 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/23/24 11:30:53 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/23/24 11:30:53 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/23/24 11:30:53 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/23/24 11:30:53 ** PID = 28576 06/23/24 11:30:53 ** Log last touched 6/22 18:08:42 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 Using config source: /etc/condor/condor_config ... Here is where DAGMAN notices that there is a rescue DAG 06/23/24 11:30:53 Parsing 1 dagfiles 06/23/24 11:30:53 Parsing goatbrot.dag ... 06/23/24 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file 06/23/24 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 06/23/24 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001 06/23/24 11:30:53 Dag contains 5 total jobs Shortly thereafter it sees that four jobs have already finished. 06/23/24 11:31:05 Bootstrapping... 06/23/24 11:31:05 Number of pre-completed nodes: 4 06/23/24 11:31:05 Registering condor_event_timer... 06/23/24 11:31:06 Sleeping for one second for log file consistency 06/23/24 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log Here is where DAGMan resubmits the montage job and waits for it to complete. 06/23/24 11:31:07 Submitting Condor Node montage job(s)... 06/23/24 11:31:07 submitting: condor_submit -a dag_node_name' '=' 'montage -a +DAGManJobId' '=' '83 -a DAGManJobId' '=' '83 -a submit_event_notes' '=' 'DAG' 'Node:' 'montage -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '\"g1,g2,g3,g4\" montage.sub 06/23/24 11:31:07 From submit: Submitting job(s). 06/23/24 11:31:07 From submit: 1 job(s) submitted to cluster 84. 06/23/24 11:31:07 assigned Condor ID (84.0.0) 06/23/24 11:31:07 Just submitted 1 job this cycle... 06/23/24 11:31:07 Currently monitoring 1 Condor log file(s) 06/23/24 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0) 06/23/24 11:31:07 Number of idle job procs: 1 06/23/24 11:31:07 Of 5 nodes total: 06/23/24 11:31:07 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:31:07 === === === === === === === 06/23/24 11:31:07 4 0 1 0 0 0 0 06/23/24 11:31:07 0 job proc(s) currently held 06/23/24 11:40:22 Currently monitoring 1 Condor log file(s) 06/23/24 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0) This is where the montage finished. 06/23/24 11:40:22 Node montage job proc (84.0.0) completed successfully. 06/23/24 11:40:22 Node montage job completed 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Of 5 nodes total: 06/23/24 11:40:22 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:40:22 === === === === === === === 06/23/24 11:40:22 5 0 0 0 0 0 0 06/23/24 11:40:22 0 job proc(s) currently held And here DAGMan decides that the work is all done. 06/23/24 11:40:22 All jobs Completed! 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of node category throttles 06/23/24 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/23/24 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/23/24 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0 Success! Now go ahead and clean up. Bonus Challenge \u00b6 If you have time, add an extra node to the DAG. Copy our original \"simple\" program, but make it exit with a 1 instead of a 0. DAGMan would consider this a failure, but you'll tell DAGMan that it's really a success. This is reasonable--many real world programs use a variety of return codes, and you might need to help DAGMan distinguish success from failure. Write a POST script that checks the return value. Check the HTCondor manual to see how to describe your post script.","title":"1.4 - Handling jobs that fail with DAGMan"},{"location":"materials/workflows/part1-ex4-failed-dag/#workflows-exercise-14-handling-a-dag-that-fails","text":"The objective of this exercise is to help you learn how DAGMan deals with job failures. DAGMan is built to help you recover from such failures.","title":"Workflows Exercise 1.4: Handling a DAG That Fails"},{"location":"materials/workflows/part1-ex4-failed-dag/#background","text":"DAGMan can handle a situation where some of the nodes in a DAG fail. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.","title":"Background"},{"location":"materials/workflows/part1-ex4-failed-dag/#breaking-things","text":"Recall that DAGMan decides that a jobs fails if its exit code is non-zero. Let's modify our montage job so that it fails. Work in the same directory where you did the last DAG. Edit montage.sub to add a -h to the arguments. It will look like this, with the -h at the beginning of the highlighted line: executable = /usr/bin/montage arguments = -h tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Submit the DAG again: username@ap40 $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 77. ----------------------------------------------------------------------- Use watch to watch the jobs until they finish. In a separate window, use tail --lines=500 -f goatbrot.dag.dagman.out to watch what DAGMan does. 06/22/24 17:57:41 Setting maximum accepts per cycle 8. 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP 06/22/24 17:57:41 ** /usr/bin/condor_dagman 06/22/24 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/22/24 17:57:41 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/22/24 17:57:41 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/22/24 17:57:41 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/22/24 17:57:41 ** PID = 26867 06/22/24 17:57:41 ** Log last touched time unavailable (No such file or directory) 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 Using config source: /etc/condor/condor_config 06/22/24 17:57:41 Using local config sources: 06/22/24 17:57:41 /etc/condor/config.d/00-chtc-global.conf 06/22/24 17:57:41 /etc/condor/config.d/01-chtc-submit.conf 06/22/24 17:57:41 /etc/condor/config.d/02-chtc-flocking.conf 06/22/24 17:57:41 /etc/condor/config.d/03-chtc-jobrouter.conf 06/22/24 17:57:41 /etc/condor/config.d/04-chtc-blacklist.conf 06/22/24 17:57:41 /etc/condor/config.d/99-osg-ss-group.conf 06/22/24 17:57:41 /etc/condor/config.d/99-roy-extras.conf 06/22/24 17:57:41 /etc/condor/condor_config.local Below is where DAGMan realizes that the montage node failed: 06/22/24 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0) 06/22/24 18:08:42 Node montage job proc (82.0.0) failed with status 1. 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Of 5 nodes total: 06/22/24 18:08:42 Done Pre Queued Post Ready Un-Ready Failed 06/22/24 18:08:42 === === === === === === === 06/22/24 18:08:42 4 0 0 0 0 0 1 06/22/24 18:08:42 0 job proc(s) currently held 06/22/24 18:08:42 Aborting DAG... 06/22/24 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001... 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of node category throttles 06/22/24 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/22/24 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/22/24 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1 DAGMan notices that one of the jobs failed because its exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved. Do you see the part where it wrote the rescue DAG? Look at the rescue DAG file. It's called a partial DAG because it indicates what part of the DAG has already been completed. username@ap40 $ cat goatbrot.dag.rescue001 # Rescue DAG file, created after running # the goatbrot.dag DAG file # Created 6 /22/2024 23 :08:42 UTC # Rescue DAG version: 2 .0.1 ( partial ) # # Total number of Nodes: 5 # Nodes premarked DONE: 4 # Nodes that failed: 1 # montage, DONE g1 DONE g2 DONE g3 DONE g4 From the comment near the top, we know that the montage node failed. Let's fix it by getting rid of the offending -h argument. Change montage.sub to look like: executable = /usr/bin/montage arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Now we can re-submit our original DAG and DAGMan will pick up where it left off. It will automatically notice the rescue DAG. If you didn't fix the problem, DAGMan would generate another rescue DAG. username@ap40 $ condor_submit_dag goatbrot.dag Running rescue DAG 1 ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 83. ----------------------------------------------------------------------- username@ap40 $ tail -f goatbrot.dag.dagman.out 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP 06/23/24 11:30:53 ** /usr/bin/condor_dagman 06/23/24 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/23/24 11:30:53 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/23/24 11:30:53 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/23/24 11:30:53 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/23/24 11:30:53 ** PID = 28576 06/23/24 11:30:53 ** Log last touched 6/22 18:08:42 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 Using config source: /etc/condor/condor_config ... Here is where DAGMAN notices that there is a rescue DAG 06/23/24 11:30:53 Parsing 1 dagfiles 06/23/24 11:30:53 Parsing goatbrot.dag ... 06/23/24 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file 06/23/24 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 06/23/24 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001 06/23/24 11:30:53 Dag contains 5 total jobs Shortly thereafter it sees that four jobs have already finished. 06/23/24 11:31:05 Bootstrapping... 06/23/24 11:31:05 Number of pre-completed nodes: 4 06/23/24 11:31:05 Registering condor_event_timer... 06/23/24 11:31:06 Sleeping for one second for log file consistency 06/23/24 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log Here is where DAGMan resubmits the montage job and waits for it to complete. 06/23/24 11:31:07 Submitting Condor Node montage job(s)... 06/23/24 11:31:07 submitting: condor_submit -a dag_node_name' '=' 'montage -a +DAGManJobId' '=' '83 -a DAGManJobId' '=' '83 -a submit_event_notes' '=' 'DAG' 'Node:' 'montage -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '\"g1,g2,g3,g4\" montage.sub 06/23/24 11:31:07 From submit: Submitting job(s). 06/23/24 11:31:07 From submit: 1 job(s) submitted to cluster 84. 06/23/24 11:31:07 assigned Condor ID (84.0.0) 06/23/24 11:31:07 Just submitted 1 job this cycle... 06/23/24 11:31:07 Currently monitoring 1 Condor log file(s) 06/23/24 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0) 06/23/24 11:31:07 Number of idle job procs: 1 06/23/24 11:31:07 Of 5 nodes total: 06/23/24 11:31:07 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:31:07 === === === === === === === 06/23/24 11:31:07 4 0 1 0 0 0 0 06/23/24 11:31:07 0 job proc(s) currently held 06/23/24 11:40:22 Currently monitoring 1 Condor log file(s) 06/23/24 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0) This is where the montage finished. 06/23/24 11:40:22 Node montage job proc (84.0.0) completed successfully. 06/23/24 11:40:22 Node montage job completed 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Of 5 nodes total: 06/23/24 11:40:22 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:40:22 === === === === === === === 06/23/24 11:40:22 5 0 0 0 0 0 0 06/23/24 11:40:22 0 job proc(s) currently held And here DAGMan decides that the work is all done. 06/23/24 11:40:22 All jobs Completed! 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of node category throttles 06/23/24 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/23/24 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/23/24 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0 Success! Now go ahead and clean up.","title":"Breaking Things"},{"location":"materials/workflows/part1-ex4-failed-dag/#bonus-challenge","text":"If you have time, add an extra node to the DAG. Copy our original \"simple\" program, but make it exit with a 1 instead of a 0. DAGMan would consider this a failure, but you'll tell DAGMan that it's really a success. This is reasonable--many real world programs use a variety of return codes, and you might need to help DAGMan distinguish success from failure. Write a POST script that checks the return value. Check the HTCondor manual to see how to describe your post script.","title":"Bonus Challenge"},{"location":"materials/workflows/part1-ex5-challenges/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus Workflows Exercise 1.5: YOUR Jobs and More on Workflows \u00b6 The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job. Challenge 1 \u00b6 Do you have any extra computation that needs to be done? Real work, from your life outside this summer school? If so, try it out on our HTCondor pool. Can't think of something? How about one of the existing distributed computing programs like distributed.net , SETI@home , Einstien@Home or others that you know. We prefer that you do your own work rather than one of these projects, but they are options. Challenge 2 \u00b6 Try to generate other Mandelbrot images. Some possible locations to look at with goatbroat: goatbrot -i 1000 -o ex1.ppm -c 0.0016437219722,-0.8224676332988 -w 2e-11 -s 1000,1000 goatbrot -i 1000 -o ex2.ppm -c 0.3958608398437499,-0.13431445312500012 -w 0.0002197265625 -s 1000,1000 goatbrot -i 1000 -o ex3.ppm -c 0.3965859374999999,-0.13378125000000013 -w 0.003515625 -s 1000,1000 You can convert ppm files with convert , like so: convert ex1.ppm ex1.jpg Now make a movie! Make a series of images where you zoom into a point in the Mandelbrot set gradually. (Those points above may work well.) Assemble these images with the \"convert\" tool which will let you convert a set of JPEG files into an MPEG movie. Challenge 3 \u00b6 Try out Pegasus. Pegasus is a workflow manager that uses DAGMan and can work in a grid environment and/or run across different types of clusters (with other queueing software). It will create the DAGs from abstract DAG descriptions and ensure they are appropriate for the location of the data and computation. Links to more information: Pegasus Website Pegasus Documentation Pegasus on OSG If you have any questions or problems, please feel free to contact the Pegasus team by emailing pegasus-support@isi.edu","title":"1.5 - Workflow Challenges"},{"location":"materials/workflows/part1-ex5-challenges/#bonus-workflows-exercise-15-your-jobs-and-more-on-workflows","text":"The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job.","title":"Bonus Workflows Exercise 1.5: YOUR Jobs and More on Workflows"},{"location":"materials/workflows/part1-ex5-challenges/#challenge-1","text":"Do you have any extra computation that needs to be done? Real work, from your life outside this summer school? If so, try it out on our HTCondor pool. Can't think of something? How about one of the existing distributed computing programs like distributed.net , SETI@home , Einstien@Home or others that you know. We prefer that you do your own work rather than one of these projects, but they are options.","title":"Challenge 1"},{"location":"materials/workflows/part1-ex5-challenges/#challenge-2","text":"Try to generate other Mandelbrot images. Some possible locations to look at with goatbroat: goatbrot -i 1000 -o ex1.ppm -c 0.0016437219722,-0.8224676332988 -w 2e-11 -s 1000,1000 goatbrot -i 1000 -o ex2.ppm -c 0.3958608398437499,-0.13431445312500012 -w 0.0002197265625 -s 1000,1000 goatbrot -i 1000 -o ex3.ppm -c 0.3965859374999999,-0.13378125000000013 -w 0.003515625 -s 1000,1000 You can convert ppm files with convert , like so: convert ex1.ppm ex1.jpg Now make a movie! Make a series of images where you zoom into a point in the Mandelbrot set gradually. (Those points above may work well.) Assemble these images with the \"convert\" tool which will let you convert a set of JPEG files into an MPEG movie.","title":"Challenge 2"},{"location":"materials/workflows/part1-ex5-challenges/#challenge-3","text":"Try out Pegasus. Pegasus is a workflow manager that uses DAGMan and can work in a grid environment and/or run across different types of clusters (with other queueing software). It will create the DAGs from abstract DAG descriptions and ensure they are appropriate for the location of the data and computation. Links to more information: Pegasus Website Pegasus Documentation Pegasus on OSG If you have any questions or problems, please feel free to contact the Pegasus team by emailing pegasus-support@isi.edu","title":"Challenge 3"}]} \ No newline at end of file +{"config":{"lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"OSG School 2024 \u00b6 Could you transform your research with vast amounts of computing? Learn how this summer at the lovely University of Wisconsin\u2013Madison During the School, August 5\u20139 , you will learn to use high-throughput computing (HTC) systems \u2014 at your own campus or using the national-scale Open Science Pool \u2014 to run large-scale computing applications that are at the heart of today\u2019s cutting-edge science. Through lectures, discussions, and lots of hands-on activities with experienced OSG staff, you will learn how HTC systems work, how to run and manage lots of jobs and huge datasets to implement a scientific computing workflow, and where to get more information and help. The school is ideal for: Researchers (especially graduate students and post-docs) in any research area for which large-scale computing is a vital part of the research process; Anyone (especially students and staff) who supports researchers who are current or potential users of high-throughput computing; Instructors (at the post-secondary level) who teach future researchers and see value in integrating high-throughput computing into their curriculum. People accepted to this program will receive financial support for basic travel and local costs associated with the School. Applications \u00b6 Applications are now closed for 2024. The deadline for applications was Monday, 1 April 2024. If still needed, have someone email a letter of recommendation for you to school@osg-htc.org (ideally PDF or plain text) For the letter of recommendation, ask someone who knows you professionally \u2014 ideally a faculty member or other supervisor. They should clearly identify your name and the \u201cOSG School 2024\u201d in the subject line and letter, so that we can associate your application and letter. Applicants: We plan to review applications in April and invite participants by early May or so. We will contact you once decisions have been made. Thank you for your patience! Contact Us \u00b6 The OSG School is the premier training event of the OSG Consortium and is held annually at UW\u2013Madison. If you have any questions about the event, feel free to email us: school@osg-htc.org OSGSchool * Image provided by Wikimedia user Av9 under Creative Commons License","title":"Home"},{"location":"#osg-school-2024","text":"Could you transform your research with vast amounts of computing? Learn how this summer at the lovely University of Wisconsin\u2013Madison During the School, August 5\u20139 , you will learn to use high-throughput computing (HTC) systems \u2014 at your own campus or using the national-scale Open Science Pool \u2014 to run large-scale computing applications that are at the heart of today\u2019s cutting-edge science. Through lectures, discussions, and lots of hands-on activities with experienced OSG staff, you will learn how HTC systems work, how to run and manage lots of jobs and huge datasets to implement a scientific computing workflow, and where to get more information and help. The school is ideal for: Researchers (especially graduate students and post-docs) in any research area for which large-scale computing is a vital part of the research process; Anyone (especially students and staff) who supports researchers who are current or potential users of high-throughput computing; Instructors (at the post-secondary level) who teach future researchers and see value in integrating high-throughput computing into their curriculum. People accepted to this program will receive financial support for basic travel and local costs associated with the School.","title":"OSG School 2024"},{"location":"#applications","text":"Applications are now closed for 2024. The deadline for applications was Monday, 1 April 2024. If still needed, have someone email a letter of recommendation for you to school@osg-htc.org (ideally PDF or plain text) For the letter of recommendation, ask someone who knows you professionally \u2014 ideally a faculty member or other supervisor. They should clearly identify your name and the \u201cOSG School 2024\u201d in the subject line and letter, so that we can associate your application and letter. Applicants: We plan to review applications in April and invite participants by early May or so. We will contact you once decisions have been made. Thank you for your patience!","title":"Applications"},{"location":"#contact-us","text":"The OSG School is the premier training event of the OSG Consortium and is held annually at UW\u2013Madison. If you have any questions about the event, feel free to email us: school@osg-htc.org OSGSchool * Image provided by Wikimedia user Av9 under Creative Commons License","title":"Contact Us"},{"location":"health/","text":"Health Guidelines \u00b6 The OSG School 2024 at the UW\u2013Madison welcomes participants from around the United States plus India, Mali, and Uganda. This page contains health guidelines for this year\u2019s School. While the focus is in COVID-19, most of these guidelines also apply to preventing the spread of other infectious illnesses (flu, colds, GI viruses, etc.). It is very important to us that everyone stays safe and healthy throughout the whole School. We will have the best event possible if everyone stays well! There are no hard rules here, just a reminder that we are all in this together . If you have any questions, concerns, or comments about these guidelines, please email us at school@osg-htc.org or message us on Slack. Before Traveling to the School \u00b6 If you tested positive for COVID recently (past 2 weeks or so), please follow CDC guidelines for what to do when sick. Even if you have no symptoms or known exposure, consider taking a rapid test before traveling to improve the odds that you are not bringing COVID to the event. If you DO test positive before the School, or if you do not feel well enough to travel for any reason, please let us know immediately so we can accommodate (see below for remote participation options). While in Madison \u00b6 Wearing a mask is welcome at the School itself when indoors or in other poorly ventilated areas. We can provide a few high-quality KN95 masks for people who would like them and have not brought their own. We encourage everyone to consider outdoor dining options when reasonable \u2014 not just for reducing risk, but also because Madison is beautiful in the summer! While in Madison, if you feel unwell, stay home or at the hotel. When you can, let School staff know why you are absent \u2014 by email or Slack \u2014 and if you would like to keep up with exercises and lectures, we will help support you remotely (see below). If you experience possible symptoms of COVID-19 , or test positive for COVID-19, follow CDC guidelines for what to do when sick. Remote Attendance \u00b6 If you are in Madison and are sick or quarantined, or if you are not able to travel to Madison, we will do our best to support you via remote attendance. We learned a lot about remote events during the pandemic! We can: Try to stream lectures live over Zoom Post all slides and exercises on the website Be active on Slack and email Conduct one-on-one consultations over Zoom As long as you feel up to it, we will do our best to support you during the School.","title":"Health Guidelines"},{"location":"health/#health-guidelines","text":"The OSG School 2024 at the UW\u2013Madison welcomes participants from around the United States plus India, Mali, and Uganda. This page contains health guidelines for this year\u2019s School. While the focus is in COVID-19, most of these guidelines also apply to preventing the spread of other infectious illnesses (flu, colds, GI viruses, etc.). It is very important to us that everyone stays safe and healthy throughout the whole School. We will have the best event possible if everyone stays well! There are no hard rules here, just a reminder that we are all in this together . If you have any questions, concerns, or comments about these guidelines, please email us at school@osg-htc.org or message us on Slack.","title":"Health Guidelines"},{"location":"health/#before-traveling-to-the-school","text":"If you tested positive for COVID recently (past 2 weeks or so), please follow CDC guidelines for what to do when sick. Even if you have no symptoms or known exposure, consider taking a rapid test before traveling to improve the odds that you are not bringing COVID to the event. If you DO test positive before the School, or if you do not feel well enough to travel for any reason, please let us know immediately so we can accommodate (see below for remote participation options).","title":"Before Traveling to the School"},{"location":"health/#while-in-madison","text":"Wearing a mask is welcome at the School itself when indoors or in other poorly ventilated areas. We can provide a few high-quality KN95 masks for people who would like them and have not brought their own. We encourage everyone to consider outdoor dining options when reasonable \u2014 not just for reducing risk, but also because Madison is beautiful in the summer! While in Madison, if you feel unwell, stay home or at the hotel. When you can, let School staff know why you are absent \u2014 by email or Slack \u2014 and if you would like to keep up with exercises and lectures, we will help support you remotely (see below). If you experience possible symptoms of COVID-19 , or test positive for COVID-19, follow CDC guidelines for what to do when sick.","title":"While in Madison"},{"location":"health/#remote-attendance","text":"If you are in Madison and are sick or quarantined, or if you are not able to travel to Madison, we will do our best to support you via remote attendance. We learned a lot about remote events during the pandemic! We can: Try to stream lectures live over Zoom Post all slides and exercises on the website Be active on Slack and email Conduct one-on-one consultations over Zoom As long as you feel up to it, we will do our best to support you during the School.","title":"Remote Attendance"},{"location":"schedule/","text":"August 4 (Sunday) \u00b6 Welcome Dinner for Participants and Staff All School participants and staff are encourage to attend! Time: Starting at 6:30 p.m. Location : Fluno Center , 601 University Avenue; Skyview Room, 8th floor There is construction all around the Fluno Center; use the map below to get to the entrance on University Avenue: Rachel, one of the School staff, will be in the hotel lobby to lead a group walking to the Fluno. She plans to arrive at the Park Hotel at about 5:40 p.m., and then the whole group will leave at about 6:00 p.m. Join the walking group, if you like! Otherwise, you are welcome to walk on your own, to get a ride (maybe even the hotel shuttle will be available), or to get there however you like. August 5 (Monday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:15 Welcome to the OSG School Tim C. 9:15 9:30 Lecture: Introduction to High Throughput Computing Christina 9:30 9:45 Exercise: Scaling Out Computing Worksheet - 9:45 10:15 Lecture: Introduction to HTCondor Andrew 10:15 10:30 Exercise: Log in - 10:30 10:45 Break - 10:45 12:15 Exercises: HTCondor basics (1.n series) - 12:15 13:15 Lunch in Computer Sciences (near 1240) - 13:15 14:00 Lecture: More HTCondor Andrew 14:15 15:00 Exercises: Many jobs (2.n series) - 15:00 15:15 Break - 15:15 15:30 Lecture: Setting goals for the School and beyond Rachel 15:30 17:00 Exercises: Goals and unfinished exercises Individual consultations - 19:00 20:30 Evening work sessions (optional) Memorial Union \u2013 Council Room (4th Floor) Note: Free, outdoor showing of Jaws (1975) at 9 p.m.! Rachel, Christina, Tim August 6 (Tuesday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:45 Lecture: Introduction to dHTC and the OSPool Tim C. 9:45 10:30 Exercises: Using the OSPool - 10:30 10:45 Break Travel document collection, [if needed](logistics/visas.md) - 10:45 11:30 Lecture: Troubleshooting jobs Showmic 11:30 12:15 Exercises: Basic troubleshooting tools - 12:15 13:30 Lunch in Computer Sciences (near 1240) 13:15: Return documents in 1240 - 13:30 14:45 Interactive: High Throughput Computing in action staff 14:45 15:00 Break - 15:00 15:45 Lecture: Software portability Rachel 15:45 17:00 Excersies: Software and unfinished exercises Individual consultations staff 19:00 20:30 Evening work session (optional) Christina, Amber, Tim August 7 (Wednesday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Working with data Andrew 9:45 10:45 Exercises: Data - 10:45 11:00 Break - 11:00 12:00 HTC Showcase Part 1 \u25b6 Michael Gerard ; Nuclear Engineering & Engineering Physics \u201cUsing CHTC to optimize the Helically Symmetric eXperiment stellarator\u201d \u25b6 Bryce Johnson ; Morgridge Institute for Research & UW\u2013Madison Computer Sciences \u201cRunning millions of biophysical simulations with OSPool\u201d - 12:00 12:30 Open Q&A and discussion time staff 12:30 13:45 Lunch, Computer Sciences (Staff to direct) Optional Domain Lunches: Christina (math); Rachel (biology); Andrew/Amber (chemistry); Ian (ML); Tim (physics & astronomy) - 13:45 17:00 Afternoon off or optional work time Individual consultations staff 19:00 20:30 Evening work session (optional) Tim C., Showmic August 8 (Thursday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Scaling Up/Independence in Research Computing Christina 9:45 10:45 Exercises: Scaling up - 10:45 11:00 Break - 11:00 12:00 Lecture DAGMan Rachel 12:00 13:15 Lunch, Computer Sciences (Staff to direct) - 13:15 14:30 Exercises: DAGMan Work Time: Apply HTC to own research Individual consultations staff 14:30 14:45 Break - 14:45 15:45 Work Time: Apply HTC to own research Individual consultations staff 15:45 16:30 Lecture: Machine Learning Ian 19:00 20:30 Evening work session (optional) Andrew, Showmic August 9 (Friday) \u00b6 Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Checkpointing Work Time: Apply HTC to own research Showmic 9:45 10:30 Work time: Apply HTC to own research Individual consultations staff 10:30 10:45 Break - 10:45 11:30 Work time: Apply HTC to own research - 11:30 11:50 Group photo (details TBD) - 11:50 13:00 Lunch, Computer Sciences (Staff to direct) Optional: Introduction to Research Computing Facilitation Christina 13:00 14:00 HTC Showcase, Part 2 \u25b6 Saloni Bhogale ; \u201cTBD\u201d \u25b6 Dan Wright ; Civil & Environmental Engineering \u201cComputational hydroclimate research enabled by HTC\u201d - 14:00 14:30 Open Q&A Work time: Apply HTC to own research Break - 14:30 15:30 Lightning talks by volunteer participants Attendees 15:30 16:00 Open Q&A and work time staff 16:00 16:45 HTC and HTCondor Philosophy Greg? 16:45 17:15 Lecture: Forward Tim C.","title":"Schedule"},{"location":"schedule/#august-4-sunday","text":"Welcome Dinner for Participants and Staff All School participants and staff are encourage to attend! Time: Starting at 6:30 p.m. Location : Fluno Center , 601 University Avenue; Skyview Room, 8th floor There is construction all around the Fluno Center; use the map below to get to the entrance on University Avenue: Rachel, one of the School staff, will be in the hotel lobby to lead a group walking to the Fluno. She plans to arrive at the Park Hotel at about 5:40 p.m., and then the whole group will leave at about 6:00 p.m. Join the walking group, if you like! Otherwise, you are welcome to walk on your own, to get a ride (maybe even the hotel shuttle will be available), or to get there however you like.","title":"August 4 (Sunday)"},{"location":"schedule/#august-5-monday","text":"Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:15 Welcome to the OSG School Tim C. 9:15 9:30 Lecture: Introduction to High Throughput Computing Christina 9:30 9:45 Exercise: Scaling Out Computing Worksheet - 9:45 10:15 Lecture: Introduction to HTCondor Andrew 10:15 10:30 Exercise: Log in - 10:30 10:45 Break - 10:45 12:15 Exercises: HTCondor basics (1.n series) - 12:15 13:15 Lunch in Computer Sciences (near 1240) - 13:15 14:00 Lecture: More HTCondor Andrew 14:15 15:00 Exercises: Many jobs (2.n series) - 15:00 15:15 Break - 15:15 15:30 Lecture: Setting goals for the School and beyond Rachel 15:30 17:00 Exercises: Goals and unfinished exercises Individual consultations - 19:00 20:30 Evening work sessions (optional) Memorial Union \u2013 Council Room (4th Floor) Note: Free, outdoor showing of Jaws (1975) at 9 p.m.! Rachel, Christina, Tim","title":"August 5 (Monday)"},{"location":"schedule/#august-6-tuesday","text":"Start End Event Instructor 8:00 8:45 Breakfast in Computer Sciences 1240 - 9:00 9:45 Lecture: Introduction to dHTC and the OSPool Tim C. 9:45 10:30 Exercises: Using the OSPool - 10:30 10:45 Break Travel document collection, [if needed](logistics/visas.md) - 10:45 11:30 Lecture: Troubleshooting jobs Showmic 11:30 12:15 Exercises: Basic troubleshooting tools - 12:15 13:30 Lunch in Computer Sciences (near 1240) 13:15: Return documents in 1240 - 13:30 14:45 Interactive: High Throughput Computing in action staff 14:45 15:00 Break - 15:00 15:45 Lecture: Software portability Rachel 15:45 17:00 Excersies: Software and unfinished exercises Individual consultations staff 19:00 20:30 Evening work session (optional) Christina, Amber, Tim","title":"August 6 (Tuesday)"},{"location":"schedule/#august-7-wednesday","text":"Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Working with data Andrew 9:45 10:45 Exercises: Data - 10:45 11:00 Break - 11:00 12:00 HTC Showcase Part 1 \u25b6 Michael Gerard ; Nuclear Engineering & Engineering Physics \u201cUsing CHTC to optimize the Helically Symmetric eXperiment stellarator\u201d \u25b6 Bryce Johnson ; Morgridge Institute for Research & UW\u2013Madison Computer Sciences \u201cRunning millions of biophysical simulations with OSPool\u201d - 12:00 12:30 Open Q&A and discussion time staff 12:30 13:45 Lunch, Computer Sciences (Staff to direct) Optional Domain Lunches: Christina (math); Rachel (biology); Andrew/Amber (chemistry); Ian (ML); Tim (physics & astronomy) - 13:45 17:00 Afternoon off or optional work time Individual consultations staff 19:00 20:30 Evening work session (optional) Tim C., Showmic","title":"August 7 (Wednesday)"},{"location":"schedule/#august-8-thursday","text":"Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Lecture: Scaling Up/Independence in Research Computing Christina 9:45 10:45 Exercises: Scaling up - 10:45 11:00 Break - 11:00 12:00 Lecture DAGMan Rachel 12:00 13:15 Lunch, Computer Sciences (Staff to direct) - 13:15 14:30 Exercises: DAGMan Work Time: Apply HTC to own research Individual consultations staff 14:30 14:45 Break - 14:45 15:45 Work Time: Apply HTC to own research Individual consultations staff 15:45 16:30 Lecture: Machine Learning Ian 19:00 20:30 Evening work session (optional) Andrew, Showmic","title":"August 8 (Thursday)"},{"location":"schedule/#august-9-friday","text":"Start End Event Instructor 8:00 8:45 Breakfast, Computer Sciences 1240 - 9:00 9:45 Checkpointing Work Time: Apply HTC to own research Showmic 9:45 10:30 Work time: Apply HTC to own research Individual consultations staff 10:30 10:45 Break - 10:45 11:30 Work time: Apply HTC to own research - 11:30 11:50 Group photo (details TBD) - 11:50 13:00 Lunch, Computer Sciences (Staff to direct) Optional: Introduction to Research Computing Facilitation Christina 13:00 14:00 HTC Showcase, Part 2 \u25b6 Saloni Bhogale ; \u201cTBD\u201d \u25b6 Dan Wright ; Civil & Environmental Engineering \u201cComputational hydroclimate research enabled by HTC\u201d - 14:00 14:30 Open Q&A Work time: Apply HTC to own research Break - 14:30 15:30 Lightning talks by volunteer participants Attendees 15:30 16:00 Open Q&A and work time staff 16:00 16:45 HTC and HTCondor Philosophy Greg? 16:45 17:15 Lecture: Forward Tim C.","title":"August 9 (Friday)"},{"location":"logistics/","text":"OSG School 2024 Logistics \u00b6 The following pages describe some of the important information about your visit to Madison for the OSG School. Please read them carefully. There will be other pages with local details soon. Visa requirements for non-residents Travel planning to and from Madison Hotel information As always: If you have questions, email us at school@osg-htc.org . Use that email address for all emails about the organization of the School. General Information About the School Schedule \u00b6 Travel Schedule \u00b6 Most participants should plan to travel as follows: Arrive on Sunday, August 4, 2024, by about 5:00 p.m. (if possible). There is a welcome dinner on Sunday evening for all participants (including instructors), and then classes begin on Monday morning. This is a nice way to get to know each other and start the week. Depart on Saturday, August 10, 2024, any time. The School ends with a closing dinner on Friday evening, so it is best to stay that night. If we offered to pay for your hotel room, we will pay for the 6 nights of this schedule. Note: If we suggested other travel dates to you in an email, then use those dates instead! School Hours \u00b6 The School is generally Monday through Friday, 9:00 a.m. to about 5:00 p.m., except Wednesday afternoon. There will be optional work sessions on Monday, Tuesday, Wednesday, and Thursday evenings. A detailed schedule will be posted before the event. Contact Information \u00b6 If you have questions, do not wait to contact us! school@osg-htc.org","title":"General information"},{"location":"logistics/#osg-school-2024-logistics","text":"The following pages describe some of the important information about your visit to Madison for the OSG School. Please read them carefully. There will be other pages with local details soon. Visa requirements for non-residents Travel planning to and from Madison Hotel information As always: If you have questions, email us at school@osg-htc.org . Use that email address for all emails about the organization of the School.","title":"OSG School 2024 Logistics"},{"location":"logistics/#general-information-about-the-school-schedule","text":"","title":"General Information About the School Schedule"},{"location":"logistics/#travel-schedule","text":"Most participants should plan to travel as follows: Arrive on Sunday, August 4, 2024, by about 5:00 p.m. (if possible). There is a welcome dinner on Sunday evening for all participants (including instructors), and then classes begin on Monday morning. This is a nice way to get to know each other and start the week. Depart on Saturday, August 10, 2024, any time. The School ends with a closing dinner on Friday evening, so it is best to stay that night. If we offered to pay for your hotel room, we will pay for the 6 nights of this schedule. Note: If we suggested other travel dates to you in an email, then use those dates instead!","title":"Travel Schedule"},{"location":"logistics/#school-hours","text":"The School is generally Monday through Friday, 9:00 a.m. to about 5:00 p.m., except Wednesday afternoon. There will be optional work sessions on Monday, Tuesday, Wednesday, and Thursday evenings. A detailed schedule will be posted before the event.","title":"School Hours"},{"location":"logistics/#contact-information","text":"If you have questions, do not wait to contact us! school@osg-htc.org","title":"Contact Information"},{"location":"logistics/account-setup/","text":".hi { font-weight: bold; color: #FF6600; } Apply for Computing Access \u00b6 We will be using two different Access Points during the OSG School - ap40.uw.osg-htc.org and ap1.facility.path-cc.io . As soon as possible please request your account access using this link: OSG School Account Registration Instructions on setting up your account can be found using this guide: Log in to uw.osg-htc.org Access Points We strongly recommend going through the registration process and trying to log in before the School, ideally before your OSG orientation session. If you run into problems contact us at support@osg-htc.org .","title":"Account setup"},{"location":"logistics/account-setup/#apply-for-computing-access","text":"We will be using two different Access Points during the OSG School - ap40.uw.osg-htc.org and ap1.facility.path-cc.io . As soon as possible please request your account access using this link: OSG School Account Registration Instructions on setting up your account can be found using this guide: Log in to uw.osg-htc.org Access Points We strongly recommend going through the registration process and trying to log in before the School, ideally before your OSG orientation session. If you run into problems contact us at support@osg-htc.org .","title":"Apply for Computing Access"},{"location":"logistics/dining/","text":"Dining \u00b6 The School provides some catered meals as a group, and you are on your own for others. When on your own, there are many dining options in Madison between the School and your hotel, especially on State Street which is only blocks away from both locations. Restaurants right on and very near to the Capitol Square, onto which the hotel faces, tend to be a little more expensive. As you go toward campus on State Street or neighboring streets, prices tend to go down. But of course, there are exceptions in both directions! It is reasonable to ask to see a menu before ordering or being seated and decide whether to stay. Food Options Near the Hotel \u00b6 Use a mapping app or rating services like Yelp to look food options. For example: Food Options Near the School \u00b6 There are not a lot of great food options very close to the School, but feel free to ask School staff for suggestions.","title":"Dining options"},{"location":"logistics/dining/#dining","text":"The School provides some catered meals as a group, and you are on your own for others. When on your own, there are many dining options in Madison between the School and your hotel, especially on State Street which is only blocks away from both locations. Restaurants right on and very near to the Capitol Square, onto which the hotel faces, tend to be a little more expensive. As you go toward campus on State Street or neighboring streets, prices tend to go down. But of course, there are exceptions in both directions! It is reasonable to ask to see a menu before ordering or being seated and decide whether to stay.","title":"Dining"},{"location":"logistics/dining/#food-options-near-the-hotel","text":"Use a mapping app or rating services like Yelp to look food options. For example:","title":"Food Options Near the Hotel"},{"location":"logistics/dining/#food-options-near-the-school","text":"There are not a lot of great food options very close to the School, but feel free to ask School staff for suggestions.","title":"Food Options Near the School"},{"location":"logistics/fun-day/","text":"Fun Activity Ideas While in Madison \u00b6 Free \u00b6 Narrated tour via UW-Madison app : Discover UW\u2013Madison using our free mobile app featuring a student-led narrated tour that is self-guided. The tour includes information about buildings, academics, transportation, housing, and all things surrounding the student experience. Start at Union South, 1308 W Dayton St (~1 minute walk from School) UW\u2013Madison Geology Museum : Large collection of geological specimens. Across Dayton Street from the School building. 1215 Dayton Street (~2 minute walk from School) L.R. Ingersoll Physics Museum : Small museum of Physics objects and demonstrations. Very short walk from the School building: Chamberlin Hall, 1150 University Avenue. (~6 minute walk from School) Terrace Open Mic Night : Enjoy a night out where all styles of music, comedy, spoken word, poetry, and more take the stage. Performances start at 7 PM on Wednesday. 800 Langdon Street (~15 minute walk from School) Tour of Wisconsin State Capitol : Tours start at 1, 2, 3, and 4 p.m. and last about 45 minutes. 2 E Main Street (~29 minute walk from School and across the Park Hotel) Henry Vilas Zoo : One-mile walk south of Computer Sciences: 702 South Randall Avenue. (~18 minute walk from School) Take a stroll or a ride on The Lakeshore Path : Reach the infamous Picnic Point or take your trip to the Arboretum! Cost \u00b6 Rent a Bcycle : Take advantage of Madison's many bike paths throughout Madison. Camp Randall Guided Tour : 1440 Monroe St; Tour starts promptly at 2:30 PM on Wednesday and will approximately last one hour; $10 per person (~8 minute walk from School) Paddling rentals on Lake Mendota : Paddling rentals, including paddleboards, kayak, and canoes. Memorial Union Terrace, $18 per hour. 800 Langdon Street (~15 minute walk from School) Tour of First Unitarian Society\u2019s Meeting House : The Landmark Auditorium was designed by Frank Lloyd Wright. $15 per person ($12.50 if booked online in advance), up to 10 people. 900 University Bay Drive (~38 minute walk; Bus accessible, with close stop) Olbrich Botanical Gardens : 16 acres outdoor (FREE); indoor: $6 conservatory; $8 butterfly house. 3330 Atwood Avenue (~15 minute drive from School; Bus accessible, with close stop) Disclaimer \u00b6 The Chazen Museum of Art has a summer closure from August 5th-9th. Madison Museum of Contemporary Art (MMoCA) is closed on Wednesdays.","title":"Madison Fun Day"},{"location":"logistics/fun-day/#fun-activity-ideas-while-in-madison","text":"","title":"Fun Activity Ideas While in Madison"},{"location":"logistics/fun-day/#free","text":"Narrated tour via UW-Madison app : Discover UW\u2013Madison using our free mobile app featuring a student-led narrated tour that is self-guided. The tour includes information about buildings, academics, transportation, housing, and all things surrounding the student experience. Start at Union South, 1308 W Dayton St (~1 minute walk from School) UW\u2013Madison Geology Museum : Large collection of geological specimens. Across Dayton Street from the School building. 1215 Dayton Street (~2 minute walk from School) L.R. Ingersoll Physics Museum : Small museum of Physics objects and demonstrations. Very short walk from the School building: Chamberlin Hall, 1150 University Avenue. (~6 minute walk from School) Terrace Open Mic Night : Enjoy a night out where all styles of music, comedy, spoken word, poetry, and more take the stage. Performances start at 7 PM on Wednesday. 800 Langdon Street (~15 minute walk from School) Tour of Wisconsin State Capitol : Tours start at 1, 2, 3, and 4 p.m. and last about 45 minutes. 2 E Main Street (~29 minute walk from School and across the Park Hotel) Henry Vilas Zoo : One-mile walk south of Computer Sciences: 702 South Randall Avenue. (~18 minute walk from School) Take a stroll or a ride on The Lakeshore Path : Reach the infamous Picnic Point or take your trip to the Arboretum!","title":"Free"},{"location":"logistics/fun-day/#cost","text":"Rent a Bcycle : Take advantage of Madison's many bike paths throughout Madison. Camp Randall Guided Tour : 1440 Monroe St; Tour starts promptly at 2:30 PM on Wednesday and will approximately last one hour; $10 per person (~8 minute walk from School) Paddling rentals on Lake Mendota : Paddling rentals, including paddleboards, kayak, and canoes. Memorial Union Terrace, $18 per hour. 800 Langdon Street (~15 minute walk from School) Tour of First Unitarian Society\u2019s Meeting House : The Landmark Auditorium was designed by Frank Lloyd Wright. $15 per person ($12.50 if booked online in advance), up to 10 people. 900 University Bay Drive (~38 minute walk; Bus accessible, with close stop) Olbrich Botanical Gardens : 16 acres outdoor (FREE); indoor: $6 conservatory; $8 butterfly house. 3330 Atwood Avenue (~15 minute drive from School; Bus accessible, with close stop)","title":"Cost"},{"location":"logistics/fun-day/#disclaimer","text":"The Chazen Museum of Art has a summer closure from August 5th-9th. Madison Museum of Contemporary Art (MMoCA) is closed on Wednesdays.","title":"Disclaimer"},{"location":"logistics/hotel/","text":".hi { font-weight: bold; color: #FF6600; } Hotel Information \u00b6 We reserved a block of rooms at an area hotel for participants from outside Madison. Best Western Premier Park Hotel 22 South Carroll Street, Madison, WI +1 (608) 285\u20118000 Please note: We will reserve your room for you, so do not contact the hotel yourself to reserve a room. Exceptions to this rule are rare and clearly communicated. Other important hotel information: Before the School, we will send you an email with your hotel confirmation number We pay only for basic room costs \u2014 you must provide a credit card to cover extra costs There is one School participant per room; to have friends or family stay with you, please ask us now Check-In Time \u00b6 The (earliest) check-in time at the hotel is 4 p.m. on your day of arrival. If you are arriving earlier, you have options: Ask the hotel if it is possible to check in earlier than 4 p.m. It is up to the hotel to decide if they can meet your request. If there is any additional expense required, you must pay that yourself. Ask the hotel to put your bags in a safe spot and enjoy Madison until 4 p.m. or later. Keep your bags with you and enjoy Madison until 4 p.m. or later. Check-Out Time \u00b6 The (latest) check-out time from the hotel is 11 a.m. on your day of departure. If you are leaving later, you have options: Ask the hotel to put your bags in a safe spot and enjoy Madison until it is time to leave. Keep your bags with you and enjoy Madison until it is time to leave. You are not required to travel directly from the hotel to the airport, but if you do, we may be able to help you arrange to use the free hotel shuttle.","title":"Hotel information"},{"location":"logistics/hotel/#hotel-information","text":"We reserved a block of rooms at an area hotel for participants from outside Madison. Best Western Premier Park Hotel 22 South Carroll Street, Madison, WI +1 (608) 285\u20118000 Please note: We will reserve your room for you, so do not contact the hotel yourself to reserve a room. Exceptions to this rule are rare and clearly communicated. Other important hotel information: Before the School, we will send you an email with your hotel confirmation number We pay only for basic room costs \u2014 you must provide a credit card to cover extra costs There is one School participant per room; to have friends or family stay with you, please ask us now","title":"Hotel Information"},{"location":"logistics/hotel/#check-in-time","text":"The (earliest) check-in time at the hotel is 4 p.m. on your day of arrival. If you are arriving earlier, you have options: Ask the hotel if it is possible to check in earlier than 4 p.m. It is up to the hotel to decide if they can meet your request. If there is any additional expense required, you must pay that yourself. Ask the hotel to put your bags in a safe spot and enjoy Madison until 4 p.m. or later. Keep your bags with you and enjoy Madison until 4 p.m. or later.","title":"Check-In Time"},{"location":"logistics/hotel/#check-out-time","text":"The (latest) check-out time from the hotel is 11 a.m. on your day of departure. If you are leaving later, you have options: Ask the hotel to put your bags in a safe spot and enjoy Madison until it is time to leave. Keep your bags with you and enjoy Madison until it is time to leave. You are not required to travel directly from the hotel to the airport, but if you do, we may be able to help you arrange to use the free hotel shuttle.","title":"Check-Out Time"},{"location":"logistics/local-transportation/","text":"Local Transportation \u00b6 You are responsible for your own transportation within Madison, but we will help coordinate and can reimburse costs between the airport and your hotel. Travel Between the Madison Airport and Your Hotel \u00b6 For travel between the Madison airport (Dane County Regional Airport) and the School hotel , the best option is the hotel shuttle service, when available. Otherwise, you may use a ride-sharing service or taxi. See below for details. We will help organize groups to take shuttles and taxis, based on arrival and departure times. Shuttle/taxi groups will be formed and emailed shortly before the School itself. Travel Between the Hotel and Campus \u00b6 For travel between the School hotel and the Computer Sciences building on campus, walking is a great option. Also, the hotel shuttle service may be available, especially if organized in advance. See below for details. Options for Getting Around \u00b6 Hotel Shuttle \u00b6 The Park Hotel operates a free shuttle service. The shuttle may not be available at all times, though, and it is best to plan ahead. Work with the hotel staff, individually or even better in groups, to use the shuttle. As noted above, we will help organize groups for the shuttle for airport arrivals on Sunday and departures on Saturday. To ask about the shuttle, either stop by the front desk of the hotel, or call +1 (608) 285-8000 and press 0 for the front desk. Explain that you are a guest at the hotel and ask if the shuttle is available for the number of people in your group; be clear about where you want to go from and to and at what time. We will send the hotel our list of groups who would like the shuttle for airport trips, but it is still best for the leader of each group to check with the hotel anyway. Walking \u00b6 It is easy to walk in and around the University of Wisconsin\u2013Madison campus, with many Madison landmarks within a mile of the School and your hotel. Use a mapping app or ask us or your hotel for a map. In particular, State Street \u2014 which connects the Capitol Square with the UW campus \u2014 is full of great restaurants and shops and is worth walking along while you are here. City of Madison Metro Bus Service \u00b6 Many Madison Metro buses stop near the hotel and pass through the University of Wisconsin\u2013Madison campus. Bus fare is $2.00, and if using a transfer ask the driver for a free transfer pass upon boarding. Google Maps is a great resource for finding the best bus routes to use in Madison, giving multiple route options for each trip. Additionally, the Madison Metro Website provides a web interface to plan your trip. Note Bus routes stop running around ~11pm each day. Taxis and Ride-Sharing Services \u00b6 Both Lyft and Uber are active in Madison, or you can choose from our local taxi companies, such as Madison Taxi and Union Cab . We cannot recommend any particular option, but those are some options we know about. Note We cannot reimburse for any taxi or rideshare service beyond the ride to and from the airport. Note We will need receipts for any ride-share or taxi fare over $25. Madison BCycle \u00b6 Madison is a great city to bike in, and there is even a short-term bike rental system called BCycle . Bcycles are available throughout the city , including near the hotel and around campus. Pricing for Bcycles can be found on their website and consist of several tiers. Note Unfortunately, we are not able to reimburse BCycle costs.","title":"Local transportation"},{"location":"logistics/local-transportation/#local-transportation","text":"You are responsible for your own transportation within Madison, but we will help coordinate and can reimburse costs between the airport and your hotel.","title":"Local Transportation"},{"location":"logistics/local-transportation/#travel-between-the-madison-airport-and-your-hotel","text":"For travel between the Madison airport (Dane County Regional Airport) and the School hotel , the best option is the hotel shuttle service, when available. Otherwise, you may use a ride-sharing service or taxi. See below for details. We will help organize groups to take shuttles and taxis, based on arrival and departure times. Shuttle/taxi groups will be formed and emailed shortly before the School itself.","title":"Travel Between the Madison Airport and Your Hotel"},{"location":"logistics/local-transportation/#travel-between-the-hotel-and-campus","text":"For travel between the School hotel and the Computer Sciences building on campus, walking is a great option. Also, the hotel shuttle service may be available, especially if organized in advance. See below for details.","title":"Travel Between the Hotel and Campus"},{"location":"logistics/local-transportation/#options-for-getting-around","text":"","title":"Options for Getting Around"},{"location":"logistics/local-transportation/#hotel-shuttle","text":"The Park Hotel operates a free shuttle service. The shuttle may not be available at all times, though, and it is best to plan ahead. Work with the hotel staff, individually or even better in groups, to use the shuttle. As noted above, we will help organize groups for the shuttle for airport arrivals on Sunday and departures on Saturday. To ask about the shuttle, either stop by the front desk of the hotel, or call +1 (608) 285-8000 and press 0 for the front desk. Explain that you are a guest at the hotel and ask if the shuttle is available for the number of people in your group; be clear about where you want to go from and to and at what time. We will send the hotel our list of groups who would like the shuttle for airport trips, but it is still best for the leader of each group to check with the hotel anyway.","title":"Hotel Shuttle"},{"location":"logistics/local-transportation/#walking","text":"It is easy to walk in and around the University of Wisconsin\u2013Madison campus, with many Madison landmarks within a mile of the School and your hotel. Use a mapping app or ask us or your hotel for a map. In particular, State Street \u2014 which connects the Capitol Square with the UW campus \u2014 is full of great restaurants and shops and is worth walking along while you are here.","title":"Walking"},{"location":"logistics/local-transportation/#city-of-madison-metro-bus-service","text":"Many Madison Metro buses stop near the hotel and pass through the University of Wisconsin\u2013Madison campus. Bus fare is $2.00, and if using a transfer ask the driver for a free transfer pass upon boarding. Google Maps is a great resource for finding the best bus routes to use in Madison, giving multiple route options for each trip. Additionally, the Madison Metro Website provides a web interface to plan your trip. Note Bus routes stop running around ~11pm each day.","title":"City of Madison Metro Bus Service"},{"location":"logistics/local-transportation/#taxis-and-ride-sharing-services","text":"Both Lyft and Uber are active in Madison, or you can choose from our local taxi companies, such as Madison Taxi and Union Cab . We cannot recommend any particular option, but those are some options we know about. Note We cannot reimburse for any taxi or rideshare service beyond the ride to and from the airport. Note We will need receipts for any ride-share or taxi fare over $25.","title":"Taxis and Ride-Sharing Services"},{"location":"logistics/local-transportation/#madison-bcycle","text":"Madison is a great city to bike in, and there is even a short-term bike rental system called BCycle . Bcycles are available throughout the city , including near the hotel and around campus. Pricing for Bcycles can be found on their website and consist of several tiers. Note Unfortunately, we are not able to reimburse BCycle costs.","title":"Madison BCycle"},{"location":"logistics/location/","text":"School Location \u00b6 The school will be held at the University of Wisconsin\u2013Madison in the Computer Sciences Building , located at 1210 West Dayton Street, Madison, WI, 53706 . This location is about 1.3 miles from your hotel. The main classroom is Room 1240 (see below). See the local transportation page for suggestions about getting around Madison. Computer Sciences Building, Room 1240 \u00b6 Most School sessions are held in Room 1240 . If you enter the building from Dayton Street: Enter straight into the building from the street Immediately turn left and go through two sets of doors Pass the elevator (on your right) and walk down the hallway 1240 is on your right up the few steps Generally, just follow signs for 1240. Restrooms \u00b6 There are restrooms across the hallway and a bit to the right of 1240. For those, or other options, just ask staff!","title":"School location"},{"location":"logistics/location/#school-location","text":"The school will be held at the University of Wisconsin\u2013Madison in the Computer Sciences Building , located at 1210 West Dayton Street, Madison, WI, 53706 . This location is about 1.3 miles from your hotel. The main classroom is Room 1240 (see below). See the local transportation page for suggestions about getting around Madison.","title":"School Location"},{"location":"logistics/location/#computer-sciences-building-room-1240","text":"Most School sessions are held in Room 1240 . If you enter the building from Dayton Street: Enter straight into the building from the street Immediately turn left and go through two sets of doors Pass the elevator (on your right) and walk down the hallway 1240 is on your right up the few steps Generally, just follow signs for 1240.","title":"Computer Sciences Building, Room 1240"},{"location":"logistics/location/#restrooms","text":"There are restrooms across the hallway and a bit to the right of 1240. For those, or other options, just ask staff!","title":"Restrooms"},{"location":"logistics/meals/","text":"Meal Information \u00b6 The School includes some group catered meals for all participants: Sunday (Aug. 4) \u2014 welcome dinner Monday (Aug. 5) \u2013 Friday (Aug. 9) \u2014 breakfast and lunch each day Friday (Aug. 9) \u2014 closing dinner Other meals not listed above are on your own. If you are not a member of the UW\u2013Madison community, we will reimburse you for the on-your-own meals, Monday through Thursday dinners; see below for details. Sorry, UW\u2013Madison folks: The rules say that we cannot reimburse you for meals here. For the meals on your own, you are welcome to join other participants and even staff! We can help with ideas and groups, if you like. There is another page with suggestions for finding dining options near the School and hotel. Catered Meals \u00b6 The catered breakfasts and lunches during the School (see above) will be served in the Computer Sciences Building. Breakfast is in the main auditorium, room 1240 , and lunch is nearby (staff will lead the way on Monday). There is nearby seating both inside and outside. Menus \u00b6 The catered meals should take into account all dietary needs that you told us about in the questionnaire. Check for labels! If you have questions, ask the catering staff (if present) or School staff. Some items, like gluten-free items, are provided in low quantities that are meant just for those people who requested them. Please do not take them unless they are for you. Sunday, August 4, 2024 \u00b6 Opening Dinner (6:30 PM - 8:30 PM) \u00b6 Location: Fluno Center - Skyview Room (on the 8th Floor) Cavatappi Pasta Gluten Free Pasta Cheese Lasagna Grilled Chicken Breast Homemade Chicken & Beef Meatballs Italian Vegetable Blend Breadsticks Marinara and Alfredo Sauce Caesar Salad Tiramisu Cannolis Includes Beverage Service Monday, August 5, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Badger Breakfast Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea Lunch (12:15 PM - 1:15 PM) \u00b6 Southwest Buffet Tortilla Chips Red Salsa Spanish Rice Black Beans Beef Barbacoa Chicken Tinga Vegan Chorizo Crumble Flour Tortillas/Corn Tortillas for GF Shredded Lettuce Diced Tomatoes Jalapeno Shredded Cheddar Cheese Sour Cream Guacamole Assorted Soda, Water, and Sparkling Water PM Break (12:30 PM - 4:00 PM) \u00b6 Assorted Soda, Water, and Sparkling Water Regular Coffee Assorted Cookies Gluten Free Cookie Tuesday, August 6, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Fresh Cut Fruit Salad Assorted Muffins Gluten Free Muffin Mini Quiches Turkey Sausage Links Vegan Sausage Patties Assorted Bottled Juices Hot Tea Regular Coffee Lunch (12:15 PM - 1:15 PM) \u00b6 Italian Buffet Caesar Salad (croutons, cheese & Kalamata olives on the side) Caesar Dressing Garlic Breadsticks Pasta Gluten Free Pasta Marinara Sauce Sliced Grilled Chicken Breast Vegan Meatballs Assorted Soda, Water, and Sparkling Water PM Break (12:30 PM - 4:00 PM) \u00b6 Assorted Dessert Bars Granola Bars (GF) Regular Coffee Wednesday, August 7, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea Lunch (12:30 PM - 1:45 PM) \u00b6 Boxed Lunches Chicken Bacon Ranch Wraps Smoked Turkey Sandwiches Southwest Salads Cookies Gluten Free Cookie Assorted Chips Assorted Chips Mediterranean Antipasto Platter Vegetable Platter with Dill Dip Italian Cold Pasta Assorted Soda, Water, and Sparkling Water Thursday, August 8, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Buckingham Breakfast Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Regular Coffee Assorted Bottled Juice Mini Quiches Turkey Sausage Links Vegan Sausage Patties Hot Tea Lunch (12:00 PM - 1:15 PM) \u00b6 Mediteranian Buffet Lemon Oregano Chicken Greek Salad with Olive Oil Vinaigrette Roasted Vegetable Couscous Stuffed Mediterranean Portobello Mushrooms (with and without feta) PM Break (12:30 PM - 4:30 PM) \u00b6 Regular Coffee Assorted Soda, Water, and Sparkling Water Assorted Cookies Gluten Free Cookie Friday, August 9, 2024 \u00b6 Breakfast (8:00 AM - 9:00 AM) \u00b6 Badger Breakfast Turkey Sausage Links Bacon Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Cinnamon Rolls Regular Coffee Assorted Bottled Juice Hot Tea Lunch (12:00 PM - 1:00 PM) \u00b6 Wisconsin Tailgate Garden Salad with Ranch and Balsamic dressing Fried Wedge Potatoes with Ketchup Brats with Kraut Diced Onions Ketchup Dijon Mustard Hamburgers Veggie Burgers Hamburger Buns / Gluten Free Bun Lettuce, Tomato, Onion platter Sliced Cheddar Cheese platter Pickles Ketchup, Mustard, Mayo PM Break (12:30 PM - 4:00 PM) \u00b6 Assorted Soda, Water, and Sparkling Water Regular Coffee Brownies Closing Dinner (6:00 PM - 8:00 PM) \u00b6 Location: Union South - Industry (3rd Floor) Global Buffet Spinach, Strawberry, Shaved Red Onion, Sesame Poppy Seed Dressing Vegetables, Dips, Spreads, Pita Chips Chicken Tikka Masala Sake Salmon Jerk Tofu Basmati Rice Naan Includes choice of coffee station or assorted cold beverages Meal Reimbursement Tips \u00b6 Again, if you are not part of the UW\u2013Madison community, we can reimburse you for dinners Monday through Thursday. We have curated a page of some possible dining options to use as inspiration. Some tips for successful reimbursements: Keep receipts for your meals \u2013 if anything so that you remember how much meals cost! We can reimburse up to $35 for dinner, including tax and tip. If it is not on the receipt, be sure to write the tip amount yourself, so you do not forget. We cannot pay for any alcohol, although non-alcoholic drinks are OK \u2014 ideally, pay for alcohol separately. We will explain the reimbursement process in detail after the School, but the tips above will help.","title":"Meal information"},{"location":"logistics/meals/#meal-information","text":"The School includes some group catered meals for all participants: Sunday (Aug. 4) \u2014 welcome dinner Monday (Aug. 5) \u2013 Friday (Aug. 9) \u2014 breakfast and lunch each day Friday (Aug. 9) \u2014 closing dinner Other meals not listed above are on your own. If you are not a member of the UW\u2013Madison community, we will reimburse you for the on-your-own meals, Monday through Thursday dinners; see below for details. Sorry, UW\u2013Madison folks: The rules say that we cannot reimburse you for meals here. For the meals on your own, you are welcome to join other participants and even staff! We can help with ideas and groups, if you like. There is another page with suggestions for finding dining options near the School and hotel.","title":"Meal Information"},{"location":"logistics/meals/#catered-meals","text":"The catered breakfasts and lunches during the School (see above) will be served in the Computer Sciences Building. Breakfast is in the main auditorium, room 1240 , and lunch is nearby (staff will lead the way on Monday). There is nearby seating both inside and outside.","title":"Catered Meals"},{"location":"logistics/meals/#menus","text":"The catered meals should take into account all dietary needs that you told us about in the questionnaire. Check for labels! If you have questions, ask the catering staff (if present) or School staff. Some items, like gluten-free items, are provided in low quantities that are meant just for those people who requested them. Please do not take them unless they are for you.","title":"Menus"},{"location":"logistics/meals/#sunday-august-4-2024","text":"","title":"Sunday, August 4, 2024"},{"location":"logistics/meals/#opening-dinner-630-pm-830-pm","text":"Location: Fluno Center - Skyview Room (on the 8th Floor) Cavatappi Pasta Gluten Free Pasta Cheese Lasagna Grilled Chicken Breast Homemade Chicken & Beef Meatballs Italian Vegetable Blend Breadsticks Marinara and Alfredo Sauce Caesar Salad Tiramisu Cannolis Includes Beverage Service","title":"Opening Dinner (6:30 PM - 8:30 PM)"},{"location":"logistics/meals/#monday-august-5-2024","text":"","title":"Monday, August 5, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am","text":"Badger Breakfast Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1215-pm-115-pm","text":"Southwest Buffet Tortilla Chips Red Salsa Spanish Rice Black Beans Beef Barbacoa Chicken Tinga Vegan Chorizo Crumble Flour Tortillas/Corn Tortillas for GF Shredded Lettuce Diced Tomatoes Jalapeno Shredded Cheddar Cheese Sour Cream Guacamole Assorted Soda, Water, and Sparkling Water","title":"Lunch (12:15 PM - 1:15 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-400-pm","text":"Assorted Soda, Water, and Sparkling Water Regular Coffee Assorted Cookies Gluten Free Cookie","title":"PM Break (12:30 PM - 4:00 PM)"},{"location":"logistics/meals/#tuesday-august-6-2024","text":"","title":"Tuesday, August 6, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_1","text":"Fresh Cut Fruit Salad Assorted Muffins Gluten Free Muffin Mini Quiches Turkey Sausage Links Vegan Sausage Patties Assorted Bottled Juices Hot Tea Regular Coffee","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1215-pm-115-pm_1","text":"Italian Buffet Caesar Salad (croutons, cheese & Kalamata olives on the side) Caesar Dressing Garlic Breadsticks Pasta Gluten Free Pasta Marinara Sauce Sliced Grilled Chicken Breast Vegan Meatballs Assorted Soda, Water, and Sparkling Water","title":"Lunch (12:15 PM - 1:15 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-400-pm_1","text":"Assorted Dessert Bars Granola Bars (GF) Regular Coffee","title":"PM Break (12:30 PM - 4:00 PM)"},{"location":"logistics/meals/#wednesday-august-7-2024","text":"","title":"Wednesday, August 7, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_2","text":"Turkey Sausage Links Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Regular Coffee Assorted Bottled Juice Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1230-pm-145-pm","text":"Boxed Lunches Chicken Bacon Ranch Wraps Smoked Turkey Sandwiches Southwest Salads Cookies Gluten Free Cookie Assorted Chips Assorted Chips Mediterranean Antipasto Platter Vegetable Platter with Dill Dip Italian Cold Pasta Assorted Soda, Water, and Sparkling Water","title":"Lunch (12:30 PM - 1:45 PM)"},{"location":"logistics/meals/#thursday-august-8-2024","text":"","title":"Thursday, August 8, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_3","text":"Buckingham Breakfast Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Regular Coffee Assorted Bottled Juice Mini Quiches Turkey Sausage Links Vegan Sausage Patties Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1200-pm-115-pm","text":"Mediteranian Buffet Lemon Oregano Chicken Greek Salad with Olive Oil Vinaigrette Roasted Vegetable Couscous Stuffed Mediterranean Portobello Mushrooms (with and without feta)","title":"Lunch (12:00 PM - 1:15 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-430-pm","text":"Regular Coffee Assorted Soda, Water, and Sparkling Water Assorted Cookies Gluten Free Cookie","title":"PM Break (12:30 PM - 4:30 PM)"},{"location":"logistics/meals/#friday-august-9-2024","text":"","title":"Friday, August 9, 2024"},{"location":"logistics/meals/#breakfast-800-am-900-am_4","text":"Badger Breakfast Turkey Sausage Links Bacon Vegan Sausage Patties Assorted Breakfast Pastries Gluten Free Muffin Seasonal Fresh Cut Fruit Salad Breakfast Potatoes Scrambled Eggs Cinnamon Rolls Regular Coffee Assorted Bottled Juice Hot Tea","title":"Breakfast (8:00 AM - 9:00 AM)"},{"location":"logistics/meals/#lunch-1200-pm-100-pm","text":"Wisconsin Tailgate Garden Salad with Ranch and Balsamic dressing Fried Wedge Potatoes with Ketchup Brats with Kraut Diced Onions Ketchup Dijon Mustard Hamburgers Veggie Burgers Hamburger Buns / Gluten Free Bun Lettuce, Tomato, Onion platter Sliced Cheddar Cheese platter Pickles Ketchup, Mustard, Mayo","title":"Lunch (12:00 PM - 1:00 PM)"},{"location":"logistics/meals/#pm-break-1230-pm-400-pm_2","text":"Assorted Soda, Water, and Sparkling Water Regular Coffee Brownies","title":"PM Break (12:30 PM - 4:00 PM)"},{"location":"logistics/meals/#closing-dinner-600-pm-800-pm","text":"Location: Union South - Industry (3rd Floor) Global Buffet Spinach, Strawberry, Shaved Red Onion, Sesame Poppy Seed Dressing Vegetables, Dips, Spreads, Pita Chips Chicken Tikka Masala Sake Salmon Jerk Tofu Basmati Rice Naan Includes choice of coffee station or assorted cold beverages","title":"Closing Dinner (6:00 PM - 8:00 PM)"},{"location":"logistics/meals/#meal-reimbursement-tips","text":"Again, if you are not part of the UW\u2013Madison community, we can reimburse you for dinners Monday through Thursday. We have curated a page of some possible dining options to use as inspiration. Some tips for successful reimbursements: Keep receipts for your meals \u2013 if anything so that you remember how much meals cost! We can reimburse up to $35 for dinner, including tax and tip. If it is not on the receipt, be sure to write the tip amount yourself, so you do not forget. We cannot pay for any alcohol, although non-alcoholic drinks are OK \u2014 ideally, pay for alcohol separately. We will explain the reimbursement process in detail after the School, but the tips above will help.","title":"Meal Reimbursement Tips"},{"location":"logistics/travel-advice/","text":"Travel Advice \u00b6 This page offers some tips for traveling to and from the OSG School. When travelling, you may experience delays, changes, or cancellations due to weather, mechanical issues, and so on. It is good to be prepared for last-minute changes. Below are some tips and ideas for dealing with travel. For health guidelines, before or during the event, please see our health guidelines page . Checking In Early \u00b6 Airlines generally allow you to check in for your flights the day before. Doing so may save you time and hassle at the airport. Go to your airline website and look for the \u201cCheck In\u201d section, then follow the steps. Finding Flight Status \u00b6 Be sure to check your flight status often, starting the day before travel begins. While you can check the status of each flight individually on the airline website (or a third-party site), you may be able to view your entire trip at once. Go to your airline website, find their section for \u201cMy Trips\u201d or something similar, and use the six-character \u201cConfirmation Number\u201d on your itinerary plus your last name to access your full itinerary, including flight status for each segment. Definitely check your flight status before leaving for the airport! If Your Arrival in Madison Is Delayed \u00b6 If your flights change and you will arrive in Madison later than planned, think about what effect that will have: If you will arrive before Sunday, 6 p.m. (or so), you should be fine. If there is time, you can still go to the hotel first; if it is after 5:30 p.m. (or so), it may be best to go straight to the Fluno Center for the welcome dinner at 6:30 p.m. If you will arrive on Sunday but after 6 p.m. (or so), you will miss the welcome dinner. Go straight to the hotel and check in, then find dinner on your own; we can reimburse you in this case. Try to let us know about the situation, when you can. If you will arrive later than Sunday, just do your best to get here. Try to let us know about your situation as soon as you can. We can help deal things like the hotel and may be able to suggest travel options. If you need to make flight changes, see below. If Your Arrival Back Home Is Delayed \u00b6 If you flights back home are delayed, there is not as much that we can do. For example, it is not clear whether we can pay for changes on return flights. Contact your airline to find out how they will get you home. If You Must Make Flight Changes \u00b6 If one or more flights are cancelled, or if we approve flight changes and their fees in advance , you will need to make new plans with your airline. If you are at an airport, it is a good idea to get in line at your airline\u2019s service counter right away. Also, you can try calling their service number while waiting in line! For any change that requires extra payment, you must get our approval and make the change through Fox World Travel , UW\u2013Madison\u2019s only approved travel agency. If you pay for a change any other way, we cannot reimburse you. Fox World Travel phone number: +1 (844) 630-3853 Note: If you call Fox World Travel on the weekend or outside of 7am\u20137:30pm (Central), they will charge us $20 just for calling. So please use this option only when you must pay for approved flight changes. If there are significant changes to your travel plans, when you have time, please email us with your news or reach out to us on Slack.","title":"Travel advice"},{"location":"logistics/travel-advice/#travel-advice","text":"This page offers some tips for traveling to and from the OSG School. When travelling, you may experience delays, changes, or cancellations due to weather, mechanical issues, and so on. It is good to be prepared for last-minute changes. Below are some tips and ideas for dealing with travel. For health guidelines, before or during the event, please see our health guidelines page .","title":"Travel Advice"},{"location":"logistics/travel-advice/#checking-in-early","text":"Airlines generally allow you to check in for your flights the day before. Doing so may save you time and hassle at the airport. Go to your airline website and look for the \u201cCheck In\u201d section, then follow the steps.","title":"Checking In Early"},{"location":"logistics/travel-advice/#finding-flight-status","text":"Be sure to check your flight status often, starting the day before travel begins. While you can check the status of each flight individually on the airline website (or a third-party site), you may be able to view your entire trip at once. Go to your airline website, find their section for \u201cMy Trips\u201d or something similar, and use the six-character \u201cConfirmation Number\u201d on your itinerary plus your last name to access your full itinerary, including flight status for each segment. Definitely check your flight status before leaving for the airport!","title":"Finding Flight Status"},{"location":"logistics/travel-advice/#if-your-arrival-in-madison-is-delayed","text":"If your flights change and you will arrive in Madison later than planned, think about what effect that will have: If you will arrive before Sunday, 6 p.m. (or so), you should be fine. If there is time, you can still go to the hotel first; if it is after 5:30 p.m. (or so), it may be best to go straight to the Fluno Center for the welcome dinner at 6:30 p.m. If you will arrive on Sunday but after 6 p.m. (or so), you will miss the welcome dinner. Go straight to the hotel and check in, then find dinner on your own; we can reimburse you in this case. Try to let us know about the situation, when you can. If you will arrive later than Sunday, just do your best to get here. Try to let us know about your situation as soon as you can. We can help deal things like the hotel and may be able to suggest travel options. If you need to make flight changes, see below.","title":"If Your Arrival in Madison Is Delayed"},{"location":"logistics/travel-advice/#if-your-arrival-back-home-is-delayed","text":"If you flights back home are delayed, there is not as much that we can do. For example, it is not clear whether we can pay for changes on return flights. Contact your airline to find out how they will get you home.","title":"If Your Arrival Back Home Is Delayed"},{"location":"logistics/travel-advice/#if-you-must-make-flight-changes","text":"If one or more flights are cancelled, or if we approve flight changes and their fees in advance , you will need to make new plans with your airline. If you are at an airport, it is a good idea to get in line at your airline\u2019s service counter right away. Also, you can try calling their service number while waiting in line! For any change that requires extra payment, you must get our approval and make the change through Fox World Travel , UW\u2013Madison\u2019s only approved travel agency. If you pay for a change any other way, we cannot reimburse you. Fox World Travel phone number: +1 (844) 630-3853 Note: If you call Fox World Travel on the weekend or outside of 7am\u20137:30pm (Central), they will charge us $20 just for calling. So please use this option only when you must pay for approved flight changes. If there are significant changes to your travel plans, when you have time, please email us with your news or reach out to us on Slack.","title":"If You Must Make Flight Changes"},{"location":"logistics/travel-planning/","text":"Travel To and From Madison \u00b6 Please wait to begin making travel arrangements until we email you about it. We plan to email everyone about travel in early June, but are starting with a small group to find and fix issues. Whether we offered to pay your travel costs or not, please make sure that we get a copy of your travel plans so that we know when to expect you here and can plan accurately. (If we offered to pay for your hotel room, we will reserve your hotel room for you.) Find the numbered section below that applies to you: 1. We Offered to Pay for Your Travel \u00b6 We want to find reasonable and comfortable travel options for you. At the same time, we must stay within budget and follow University rules about arranging and paying for your travel costs. Let\u2019s work together to find something that makes sense for everyone. Here are ideas that have helped some School travelers in past years: If you are near Madison, consider driving; we can reimburse mileage and tolls up to a point, plus parking. Or look into bus routes, especially from larger cities like Chicago. The buses are very comfortable, have wi-fi, and run frequently. If you fly, try to get flights to and from Madison (MSN) itself. In some cases, we may ask you to consider flying to Milwaukee (1\u00bd hours away) or Chicago (2\u00bd hours away), then taking a direct bus to Madison; we do this only when the costs or itinerary options to Madison are terrible. If you fly, be flexible about departure times \u2014 early and late flights are often the least expensive. We do not like very early or very late flights any more than you do, so we will work hard to find reasonable flight times. Note: Please try to complete your travel plans before about July 4th, when rates may go up. Travel by Airplane \u00b6 Do NOT buy your own airline tickets . University rules say that our travel agency, Travel Incorporated, must buy your tickets. Note: The University is changing travel agencies on 1 July 2024. Please try to complete air travel arrangements by Thursday, 27 June 2024. Use the following information to get air travel tickets: In the travel email that we sent you, click the link to Travel Incorporated\u2019s \u201cUWS Traveler Booking Form\u201d (on smartsheet.com); on that form: Group Number: Copy and paste this: UWMSN061523 Traveler Type: Select \u201cGuest\u201d Concur Profile? Select \u201cNo\u201d Destination Type: Select \u201cDomestic\u201d Will a rental be needed? Select \u201cNo\u201d \u2014 we cannot pay for a rental car Are Hotel Accommodations needed? Select \u201cNo\u201d \u2014 we will arrange your hotel room separately Guest Information: Please contact us first to bring guests We must review and approve some itineraries. Travel Inc can purchase tickets directly in many cases. But if the Travel Inc agent says that your trip must be reviewed, do not worry! It just means that we need to check the budget, options, and UW rules. We hope to approve your first choice, or we will work with you and Travel Inc to find another reasonable one. Common reasons for a trip needing review are: total trip cost over $800, travel starting and ending at different locations, and travel on dates other than August 4 and 10. Approval takes time, so it may take 1\u20132 days to get confirmation. Airplane tickets cannot be held without purchase over a weekend, so avoid contacting Travel Inc late on Fridays. Please be considerate of the Travel Inc agent(s) you work with. They work hard to find good options for you, but they must also follow our rules. If you feel that they are not providing the options that you want, you should email us . We will help resolve any issues. Do not argue with the Travel Inc agents, especially about options you find online \u2014 there are many reasons why that option might not be available to us. Travel by Bus \u00b6 For some nearby locations, or in addition to air travel to Chicago or Milwaukee, it may be helpful to take a bus to Madison. Bus companies that School travelers have used often in the past are: Van Galder Bus , especially from Chicago Badger Bus , especially from Milwaukee To get bus tickets, pick one method: Ask us to buy bus tickets for you in advance. This is the easiest option all around. Just email us at school@osg-htc.org ; include your desired travel dates (tickets are not specific by time), and start and end bus stations or stops. Buy bus tickets for yourself. You may purchase bus tickets yourself before or on the day of travel. If you purchase your own tickets, you must get approval from the School for the estimated cost first, then request reimbursement from us after the School. If you purchase your own tickets, save the original receipt (even if by email). It is best to have a detailed receipt (including your name, itinerary, date of purchase, and total amount paid), but a regular ticket stub (e.g., without your name or date) should work fine. Just get what you can! Be sure to email us with your bus plans, including: Transportation provider(s) (e.g., Van Galder bus) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison Actual or estimated cost (indicate which) Travel by Personal Car \u00b6 If you are driving to Madison, you will be reimbursed the mileage rate of $0.670 per mile for the shortest round-trip distance (as calculated by Google Maps), plus tolls. Also, we will pay for parking costs for the week at the hotel in Madison (but not elsewhere). We recommend keeping your receipts for tolls. Note: Due to the high mileage reimbursement rate, driving can be an expensive option! We reserve the right to limit your total driving reimbursement, so work with us on the details. To travel by personal car, please check with us first. We may search for comparable flight options, to make sure that driving is the least expensive method. Be sure to email us with your travel plans as soon as possible. Try to include: Departure date from home, location (for mileage calculation), and approximate time of arrival in Madison Departure date and approximate time from Madison, and return location (for mileage calculation) if different than above 2. We Are Not Paying for Your Travel \u00b6 If you are paying for your own travel or if someone else is paying for it, go ahead and make your travel arrangements now! Just remember to arrive on Sunday, August 4, before about 5:00 pm and depart on Saturday, August 10, or whatever dates we suggested directly to you. For other travel dates, check with us first, please! Be sure to email us with your travel plans as soon as possible. Try to include: Transportation provider(s) (e.g., airline) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison (e.g., airport, bus station, etc.)","title":"Travel planning"},{"location":"logistics/travel-planning/#travel-to-and-from-madison","text":"Please wait to begin making travel arrangements until we email you about it. We plan to email everyone about travel in early June, but are starting with a small group to find and fix issues. Whether we offered to pay your travel costs or not, please make sure that we get a copy of your travel plans so that we know when to expect you here and can plan accurately. (If we offered to pay for your hotel room, we will reserve your hotel room for you.) Find the numbered section below that applies to you:","title":"Travel To and From Madison"},{"location":"logistics/travel-planning/#1-we-offered-to-pay-for-your-travel","text":"We want to find reasonable and comfortable travel options for you. At the same time, we must stay within budget and follow University rules about arranging and paying for your travel costs. Let\u2019s work together to find something that makes sense for everyone. Here are ideas that have helped some School travelers in past years: If you are near Madison, consider driving; we can reimburse mileage and tolls up to a point, plus parking. Or look into bus routes, especially from larger cities like Chicago. The buses are very comfortable, have wi-fi, and run frequently. If you fly, try to get flights to and from Madison (MSN) itself. In some cases, we may ask you to consider flying to Milwaukee (1\u00bd hours away) or Chicago (2\u00bd hours away), then taking a direct bus to Madison; we do this only when the costs or itinerary options to Madison are terrible. If you fly, be flexible about departure times \u2014 early and late flights are often the least expensive. We do not like very early or very late flights any more than you do, so we will work hard to find reasonable flight times. Note: Please try to complete your travel plans before about July 4th, when rates may go up.","title":"1. We Offered to Pay for Your Travel"},{"location":"logistics/travel-planning/#travel-by-airplane","text":"Do NOT buy your own airline tickets . University rules say that our travel agency, Travel Incorporated, must buy your tickets. Note: The University is changing travel agencies on 1 July 2024. Please try to complete air travel arrangements by Thursday, 27 June 2024. Use the following information to get air travel tickets: In the travel email that we sent you, click the link to Travel Incorporated\u2019s \u201cUWS Traveler Booking Form\u201d (on smartsheet.com); on that form: Group Number: Copy and paste this: UWMSN061523 Traveler Type: Select \u201cGuest\u201d Concur Profile? Select \u201cNo\u201d Destination Type: Select \u201cDomestic\u201d Will a rental be needed? Select \u201cNo\u201d \u2014 we cannot pay for a rental car Are Hotel Accommodations needed? Select \u201cNo\u201d \u2014 we will arrange your hotel room separately Guest Information: Please contact us first to bring guests We must review and approve some itineraries. Travel Inc can purchase tickets directly in many cases. But if the Travel Inc agent says that your trip must be reviewed, do not worry! It just means that we need to check the budget, options, and UW rules. We hope to approve your first choice, or we will work with you and Travel Inc to find another reasonable one. Common reasons for a trip needing review are: total trip cost over $800, travel starting and ending at different locations, and travel on dates other than August 4 and 10. Approval takes time, so it may take 1\u20132 days to get confirmation. Airplane tickets cannot be held without purchase over a weekend, so avoid contacting Travel Inc late on Fridays. Please be considerate of the Travel Inc agent(s) you work with. They work hard to find good options for you, but they must also follow our rules. If you feel that they are not providing the options that you want, you should email us . We will help resolve any issues. Do not argue with the Travel Inc agents, especially about options you find online \u2014 there are many reasons why that option might not be available to us.","title":"Travel by Airplane"},{"location":"logistics/travel-planning/#travel-by-bus","text":"For some nearby locations, or in addition to air travel to Chicago or Milwaukee, it may be helpful to take a bus to Madison. Bus companies that School travelers have used often in the past are: Van Galder Bus , especially from Chicago Badger Bus , especially from Milwaukee To get bus tickets, pick one method: Ask us to buy bus tickets for you in advance. This is the easiest option all around. Just email us at school@osg-htc.org ; include your desired travel dates (tickets are not specific by time), and start and end bus stations or stops. Buy bus tickets for yourself. You may purchase bus tickets yourself before or on the day of travel. If you purchase your own tickets, you must get approval from the School for the estimated cost first, then request reimbursement from us after the School. If you purchase your own tickets, save the original receipt (even if by email). It is best to have a detailed receipt (including your name, itinerary, date of purchase, and total amount paid), but a regular ticket stub (e.g., without your name or date) should work fine. Just get what you can! Be sure to email us with your bus plans, including: Transportation provider(s) (e.g., Van Galder bus) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison Actual or estimated cost (indicate which)","title":"Travel by Bus"},{"location":"logistics/travel-planning/#travel-by-personal-car","text":"If you are driving to Madison, you will be reimbursed the mileage rate of $0.670 per mile for the shortest round-trip distance (as calculated by Google Maps), plus tolls. Also, we will pay for parking costs for the week at the hotel in Madison (but not elsewhere). We recommend keeping your receipts for tolls. Note: Due to the high mileage reimbursement rate, driving can be an expensive option! We reserve the right to limit your total driving reimbursement, so work with us on the details. To travel by personal car, please check with us first. We may search for comparable flight options, to make sure that driving is the least expensive method. Be sure to email us with your travel plans as soon as possible. Try to include: Departure date from home, location (for mileage calculation), and approximate time of arrival in Madison Departure date and approximate time from Madison, and return location (for mileage calculation) if different than above","title":"Travel by Personal Car"},{"location":"logistics/travel-planning/#2-we-are-not-paying-for-your-travel","text":"If you are paying for your own travel or if someone else is paying for it, go ahead and make your travel arrangements now! Just remember to arrive on Sunday, August 4, before about 5:00 pm and depart on Saturday, August 10, or whatever dates we suggested directly to you. For other travel dates, check with us first, please! Be sure to email us with your travel plans as soon as possible. Try to include: Transportation provider(s) (e.g., airline) Arrival date and approximate time Departure date and approximate time Arrival and departure location within Madison (e.g., airport, bus station, etc.)","title":"2. We Are Not Paying for Your Travel"},{"location":"logistics/visas/","text":"Documentation Requirements for Non-Resident Aliens \u00b6 This page is for Non-Resident Aliens only. If you are a United States citizen or permanent resident or member of the UW\u2013Madison community, this page does not apply to you. For the University of Wisconsin to pay for your travel, hotel, or meal expenses, we must have certain personal information from you. We collect as little information as possible and do not share it except with University staff who need it. Most of what we need comes from the online form you completed after accepting our invitation to attend. When you come to the School in Madison, we will need to look at and verify your travel documents. Please bring all travel documents to the School! See below for details. Tasks To Do Now \u00b6 Please check your passport and visa for travel in the United States now. Make sure that all documents are valid from now and until after the School ends. If any documents are expired or will expire before the end of the School: Tell us immediately, so that we can help you Begin the process for updating your documents immediately Do whatever you can to expedite the update process The University of Wisconsin cannot pay for or reimburse you for costs without valid travel documents. We have no control over this policy and there are no exceptions. If you are in the United States on a J-1 Scholar visa, there are extra steps needed to make the University and Federal government happy. If you have a J-1 visa and have not heard from us about it already, please email us immediately so that we can help. Documents to Bring to the School \u00b6 When you come to Madison, you must bring: Passport U.S. visa U.S. Customs and Border Protection form I-94 If you entered the U.S. before 30 April 2013, the I-94 should be stapled into your passport \u2014 do not remove it! If you entered the U.S. after 30 April 2013, the I-94 is stored electronically; you can request a copy to print from CBP If you are Canadian, you may use a second form of picture ID instead of the I-94 if you did not obtain an I-94. Additional forms specified in the table below: If you have this visa We will also need F-1 (Student) Form I-20 (original document, not a copy) J-1 (Visitor) Form DS-2019 (original document, not a copy) Visa Waiver Program Paper copy of ESTA Authorization Please bring all required information and documents to the School, especially on Tuesday, August 6. School staff will make copies of the documents and return them to you as quickly as possible. We will announce further details in class.","title":"Visa requirements"},{"location":"logistics/visas/#documentation-requirements-for-non-resident-aliens","text":"This page is for Non-Resident Aliens only. If you are a United States citizen or permanent resident or member of the UW\u2013Madison community, this page does not apply to you. For the University of Wisconsin to pay for your travel, hotel, or meal expenses, we must have certain personal information from you. We collect as little information as possible and do not share it except with University staff who need it. Most of what we need comes from the online form you completed after accepting our invitation to attend. When you come to the School in Madison, we will need to look at and verify your travel documents. Please bring all travel documents to the School! See below for details.","title":"Documentation Requirements for Non-Resident Aliens"},{"location":"logistics/visas/#tasks-to-do-now","text":"Please check your passport and visa for travel in the United States now. Make sure that all documents are valid from now and until after the School ends. If any documents are expired or will expire before the end of the School: Tell us immediately, so that we can help you Begin the process for updating your documents immediately Do whatever you can to expedite the update process The University of Wisconsin cannot pay for or reimburse you for costs without valid travel documents. We have no control over this policy and there are no exceptions. If you are in the United States on a J-1 Scholar visa, there are extra steps needed to make the University and Federal government happy. If you have a J-1 visa and have not heard from us about it already, please email us immediately so that we can help.","title":"Tasks To Do Now"},{"location":"logistics/visas/#documents-to-bring-to-the-school","text":"When you come to Madison, you must bring: Passport U.S. visa U.S. Customs and Border Protection form I-94 If you entered the U.S. before 30 April 2013, the I-94 should be stapled into your passport \u2014 do not remove it! If you entered the U.S. after 30 April 2013, the I-94 is stored electronically; you can request a copy to print from CBP If you are Canadian, you may use a second form of picture ID instead of the I-94 if you did not obtain an I-94. Additional forms specified in the table below: If you have this visa We will also need F-1 (Student) Form I-20 (original document, not a copy) J-1 (Visitor) Form DS-2019 (original document, not a copy) Visa Waiver Program Paper copy of ESTA Authorization Please bring all required information and documents to the School, especially on Tuesday, August 6. School staff will make copies of the documents and return them to you as quickly as possible. We will announce further details in class.","title":"Documents to Bring to the School"},{"location":"materials/","text":"OSG School Materials \u00b6 School Overview and Intro \u00b6 View the slides: [Slides coming soon] Intro to HTC and HTCondor Job Execution \u00b6 Intro to HTC Slides \u00b6 Intro to HTC: [Slides coming soon] Worksheet: [Slides coming soon] Intro to HTCondor Slides \u00b6 View the slides: pdf Intro Exercises 1: Running and Viewing Simple Jobs (Strongly Recommended) \u00b6 Exercise 1.1: Log in to the local submit machine and look around Exercise 1.2: Experiment with HTCondor commands Exercise 1.3: Run jobs! Exercise 1.4: Read and interpret log files Exercise 1.5: Determining Resource Needs Exercise 1.6: Remove jobs from the queue Bonus Exercises: Job Attributes and Handling \u00b6 Bonus Exercise 1.7: Compile and run some C code Bonus Exercise 1.8: Explore condor_q Bonus Exercise 1.9: Explore condor_status Intro to HTCondor Multiple Job Execution \u00b6 View the Slides: [Slides coming soon] Intro Exercises 2: Running Many HTC Jobs (Strongly Recommended) \u00b6 Exercise 2.1: Work with input and output files Exercise 2.2: Use queue N , $(Cluster) , and $(Process) Exercise 2.3: Use queue from with custom variables Bonus Exercise 2.4: Use queue matching with a custom variable OSG \u00b6 View the slides: [Slides coming soon] OSG Exercises: Comparing PATh and OSG (Strongly Recommended) \u00b6 Exercise 1.1: Log in to the OSPool Access Point Exercise 1.2: Running jobs in the OSPool Exercise 1.3: Hardware differences between PATh and OSG Exercise 1.4: Software differences in OSPool Troubleshooting \u00b6 Slides: [Slides coming soon] Troubleshooting Exercises: \u00b6 Exercise 1.1: Troubleshooting Jobs Exercise 1.2: Job Retry Software \u00b6 Slides: [Slides coming soon] Software Exercises 1: Exploring Containers \u00b6 Exercise 1.1: Run and Explore Apptainer Containers Exercise 1.2: Use Apptainer Containers in OSPool Jobs Exercise 1.3: Use Docker Containers in OSPool Jobs Exercise 1.4: Build, Test, and Deploy an Apptainer Container Exercise 1.5: Choose Software Options Software Exercises 2: Preparing Scripts \u00b6 Exercise 2.1: Build an HTC-Friendly Executable Software Exercises 3: Container Examples (Optional) \u00b6 Exercise 3.1: Create an Apptainer Definition Files Exercise 3.2: Build Your Own Docker Container Software Exercises 4: Exploring Compiled Software (Optional) \u00b6 Exercise 4.1: Download and Use Compiled Software Exercise 4.2: Use a Wrapper Script To Run Software Exercise 4.3: Using Arguments With Wrapper Scripts Software Exercises 5: Compiled Software Examples (Optional) \u00b6 Exercise 5.1: Compiling a Research Software Exercise 5.2: Compiling Python and Running Jobs Exercise 5.3: Using Conda Environments Exercise 5.4: Compiling and Running a Simple Code Data \u00b6 View the slides: [Slides coming soon] Data Exercises 1: HTCondor File Transfer (Strongly Recommended) \u00b6 Exercise 1.1: Understanding a job's data needs Exercise 1.2: transfer_input_files, transfer_output_files, and remaps Exercise 1.3: Splitting input Data Exercises 2: Using OSDF (Strongly Recommended) \u00b6 Exercise 2.1: OSDF for inputs Exercise 2.2: OSDF for outputs Scaling Up \u00b6 View the slides: [Slides coming soon] Scaling Up Exercises \u00b6 Exercise 1.1: Organizing HTC workloads Exercise 1.2: Investigating Job Attributes Exercise 1.3: Getting Job Information from Log Files Workflows with DAGMan \u00b6 View the slides: [Slides coming soon] DAGMan Exercises 1 \u00b6 Exercise 1.1: Coordinating set of jobs: A simple DAG Exercise 1.2: A brief detour through the Mandelbrot set Exercise 1.3: A more complex DAG Exercise 1.4: Handling jobs that fail with DAGMan Exercise 1.5: Workflow Challenges Extra Topics \u00b6 Self-checkpointing for long-running jobs \u00b6 View the slides: [Slides coming soon] Exercise 1.1: Trying out self-checkpointing Special Environments \u00b6 View the slides: [Slides coming soon] Special Environments Exercises 1 \u00b6 Exercise 1.1: GPUs Introduction to Research Computing Facilitation \u00b6 View the slides: [Slides coming soon] Final Talks \u00b6 Philosophy: [Slides coming soon] Final thoughts: [Slides coming soon]","title":"Overview"},{"location":"materials/#osg-school-materials","text":"","title":"OSG School Materials"},{"location":"materials/#school-overview-and-intro","text":"View the slides: [Slides coming soon]","title":"School Overview and Intro"},{"location":"materials/#intro-to-htc-and-htcondor-job-execution","text":"","title":"Intro to HTC and HTCondor Job Execution"},{"location":"materials/#intro-to-htc-slides","text":"Intro to HTC: [Slides coming soon] Worksheet: [Slides coming soon]","title":"Intro to HTC Slides"},{"location":"materials/#intro-to-htcondor-slides","text":"View the slides: pdf","title":"Intro to HTCondor Slides"},{"location":"materials/#intro-exercises-1-running-and-viewing-simple-jobs-strongly-recommended","text":"Exercise 1.1: Log in to the local submit machine and look around Exercise 1.2: Experiment with HTCondor commands Exercise 1.3: Run jobs! Exercise 1.4: Read and interpret log files Exercise 1.5: Determining Resource Needs Exercise 1.6: Remove jobs from the queue","title":"Intro Exercises 1: Running and Viewing Simple Jobs (Strongly Recommended)"},{"location":"materials/#bonus-exercises-job-attributes-and-handling","text":"Bonus Exercise 1.7: Compile and run some C code Bonus Exercise 1.8: Explore condor_q Bonus Exercise 1.9: Explore condor_status","title":"Bonus Exercises: Job Attributes and Handling"},{"location":"materials/#intro-to-htcondor-multiple-job-execution","text":"View the Slides: [Slides coming soon]","title":"Intro to HTCondor Multiple Job Execution"},{"location":"materials/#intro-exercises-2-running-many-htc-jobs-strongly-recommended","text":"Exercise 2.1: Work with input and output files Exercise 2.2: Use queue N , $(Cluster) , and $(Process) Exercise 2.3: Use queue from with custom variables Bonus Exercise 2.4: Use queue matching with a custom variable","title":"Intro Exercises 2: Running Many HTC Jobs (Strongly Recommended)"},{"location":"materials/#osg","text":"View the slides: [Slides coming soon]","title":"OSG"},{"location":"materials/#osg-exercises-comparing-path-and-osg-strongly-recommended","text":"Exercise 1.1: Log in to the OSPool Access Point Exercise 1.2: Running jobs in the OSPool Exercise 1.3: Hardware differences between PATh and OSG Exercise 1.4: Software differences in OSPool","title":"OSG Exercises: Comparing PATh and OSG (Strongly Recommended)"},{"location":"materials/#troubleshooting","text":"Slides: [Slides coming soon]","title":"Troubleshooting"},{"location":"materials/#troubleshooting-exercises","text":"Exercise 1.1: Troubleshooting Jobs Exercise 1.2: Job Retry","title":"Troubleshooting Exercises:"},{"location":"materials/#software","text":"Slides: [Slides coming soon]","title":"Software"},{"location":"materials/#software-exercises-1-exploring-containers","text":"Exercise 1.1: Run and Explore Apptainer Containers Exercise 1.2: Use Apptainer Containers in OSPool Jobs Exercise 1.3: Use Docker Containers in OSPool Jobs Exercise 1.4: Build, Test, and Deploy an Apptainer Container Exercise 1.5: Choose Software Options","title":"Software Exercises 1: Exploring Containers"},{"location":"materials/#software-exercises-2-preparing-scripts","text":"Exercise 2.1: Build an HTC-Friendly Executable","title":"Software Exercises 2: Preparing Scripts"},{"location":"materials/#software-exercises-3-container-examples-optional","text":"Exercise 3.1: Create an Apptainer Definition Files Exercise 3.2: Build Your Own Docker Container","title":"Software Exercises 3: Container Examples (Optional)"},{"location":"materials/#software-exercises-4-exploring-compiled-software-optional","text":"Exercise 4.1: Download and Use Compiled Software Exercise 4.2: Use a Wrapper Script To Run Software Exercise 4.3: Using Arguments With Wrapper Scripts","title":"Software Exercises 4: Exploring Compiled Software (Optional)"},{"location":"materials/#software-exercises-5-compiled-software-examples-optional","text":"Exercise 5.1: Compiling a Research Software Exercise 5.2: Compiling Python and Running Jobs Exercise 5.3: Using Conda Environments Exercise 5.4: Compiling and Running a Simple Code","title":"Software Exercises 5: Compiled Software Examples (Optional)"},{"location":"materials/#data","text":"View the slides: [Slides coming soon]","title":"Data"},{"location":"materials/#data-exercises-1-htcondor-file-transfer-strongly-recommended","text":"Exercise 1.1: Understanding a job's data needs Exercise 1.2: transfer_input_files, transfer_output_files, and remaps Exercise 1.3: Splitting input","title":"Data Exercises 1: HTCondor File Transfer (Strongly Recommended)"},{"location":"materials/#data-exercises-2-using-osdf-strongly-recommended","text":"Exercise 2.1: OSDF for inputs Exercise 2.2: OSDF for outputs","title":"Data Exercises 2: Using OSDF (Strongly Recommended)"},{"location":"materials/#scaling-up","text":"View the slides: [Slides coming soon]","title":"Scaling Up"},{"location":"materials/#scaling-up-exercises","text":"Exercise 1.1: Organizing HTC workloads Exercise 1.2: Investigating Job Attributes Exercise 1.3: Getting Job Information from Log Files","title":"Scaling Up Exercises"},{"location":"materials/#workflows-with-dagman","text":"View the slides: [Slides coming soon]","title":"Workflows with DAGMan"},{"location":"materials/#dagman-exercises-1","text":"Exercise 1.1: Coordinating set of jobs: A simple DAG Exercise 1.2: A brief detour through the Mandelbrot set Exercise 1.3: A more complex DAG Exercise 1.4: Handling jobs that fail with DAGMan Exercise 1.5: Workflow Challenges","title":"DAGMan Exercises 1"},{"location":"materials/#extra-topics","text":"","title":"Extra Topics"},{"location":"materials/#self-checkpointing-for-long-running-jobs","text":"View the slides: [Slides coming soon] Exercise 1.1: Trying out self-checkpointing","title":"Self-checkpointing for long-running jobs"},{"location":"materials/#special-environments","text":"View the slides: [Slides coming soon]","title":"Special Environments"},{"location":"materials/#special-environments-exercises-1","text":"Exercise 1.1: GPUs","title":"Special Environments Exercises 1"},{"location":"materials/#introduction-to-research-computing-facilitation","text":"View the slides: [Slides coming soon]","title":"Introduction to Research Computing Facilitation"},{"location":"materials/#final-talks","text":"Philosophy: [Slides coming soon] Final thoughts: [Slides coming soon]","title":"Final Talks"},{"location":"materials/checkpoint/part1-ex1-checkpointing/","text":"Self-Checkpointing Exercise 1.1: Trying It Out \u00b6 The goal of this exercise is to practice writing a submit file for self-checkpointing, and to see the process in action. Calculating Fibonacci numbers \u2026 slowly \u00b6 The sample code for this exercise calculates the Fibonacci number resulting from a given set of iterations. Because this is a trival computation, the code includes a delay in each iteration through the main loop; this simulates a more intensive computation. To get set up: Log in to ap40.uw.osg-htc.org ( ap1 is fine, too) Create and change into a new directory for this exercise Download the Python script that is the main executable for this exercise: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/fibonacci.py If you want to run the script directly, make it executable first: user@server $ chmod 0755 fibonacci.py Take a look at the code, if you like. It is not very elegant, but it gets the job done. A few notes: The script takes a single argument, the number of iterations to run. To minimize computing time while leaving time to explore, 10 is a good number of iterations. The script checkpoints every other iteration through the main loop. The exit status code for a checkpoint is 85. It prints some output to standard out along the way, to let you know what is going on. The final result is written to a separate file named fibonacci.result . This file does not exist until the very end of the complete run. It is safe to run from the command line on an access point: user@server $ ./fibonacci.py 10 If you run it, what happens? (Due to the 30-second delay, be patient.) Can you explain its behavior? What happens if you run it again, without changing any files in between? Why? Preparing to run \u00b6 Now you have an executable and you know how to run it. It is time to prepare it for submission to HTCondor! Using what you know about the script (above), and using information in the slides from today, try writing a submit file that runs this software and implements exit-driven self-checkpointing. The Python code itself is ready and should not need any changes. Just use a plain queue statement, one job is enough to experiment on. Before you submit, read the next section first! Running and monitoring \u00b6 With the 30-second delay per iteration in the code and the suggested 10 iterations, once the script starts running you have about 5 minutes of runtime in which to see what is going on. So it may help to read through this section and then return here and submit your job. If your job has problems or finishes before you have the chance to do all the steps below, just remove the extra files (besides the Python script and your submit file) and try again! Submission and first checkpoint \u00b6 Submit the job Look at the contents of the submit directory \u2014 what changed? Start watching the log file: tail -n 100 -f YOUR-LOG-FILENAME.log Be patient! As HTCondor adds more lines to the end of your log file, they will appear automatically. Thus, nothing much will happen until HTCondor starts running your job. When it does, you will see three sets of messages in the log file quickly: Started transferring input files Finished transferring input files Job executing on host: (Of course, each message will contain a lot of other characters!) Now wait about 1 minute, and you should see two more messages appear: Started transferring output files Finished transferring output files That is the first checkpoint happening! Forcing your job to stop running \u00b6 Now, assuming that your job is still running (check condor_q again), you can force HTCondor to remove ( evict ) your job before it finishes: Run condor_q to get the job ID of the running job Run condor_vacate_job JOB_ID , where you replace JOB_ID with your job ID from above Monitor the action again by running tail -n 100 -f YOUR-LOG-FILENAME.log Finishing the job and wrap-up \u00b6 Be patient again! You removed your running job, and so HTCondor put it back in the queue as idle. If you wait a minute or two, you should see that HTCondor starts running the job again. In the log file, look carefully for the two Job executing on host: messages. Does it seem like you ran on the same computer again or on a different one? Both are possible! Let your job finish running this time. There should be a Job terminated of its own accord message near the end. Did you get results? Go through all the files and see what they contain. The log and output files are probably the most interesting. But did you get a result file, too? Did the output file \u2014 that is, whatever file you named in the output line of your submit file \u2014 contain everything that you expected it to? Conclusion \u00b6 This has been a brief and simple tour of self-checkpointing. If you would like to learn more, please read the Self-Checkpointing Applications section of the HTCondor Manual. Or talk to School staff about it. Or contact support@osg-htc.org for further help at any time.","title":"1.1 - Trying out self-checkpointing"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#self-checkpointing-exercise-11-trying-it-out","text":"The goal of this exercise is to practice writing a submit file for self-checkpointing, and to see the process in action.","title":"Self-Checkpointing Exercise 1.1: Trying It Out"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#calculating-fibonacci-numbers-slowly","text":"The sample code for this exercise calculates the Fibonacci number resulting from a given set of iterations. Because this is a trival computation, the code includes a delay in each iteration through the main loop; this simulates a more intensive computation. To get set up: Log in to ap40.uw.osg-htc.org ( ap1 is fine, too) Create and change into a new directory for this exercise Download the Python script that is the main executable for this exercise: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/fibonacci.py If you want to run the script directly, make it executable first: user@server $ chmod 0755 fibonacci.py Take a look at the code, if you like. It is not very elegant, but it gets the job done. A few notes: The script takes a single argument, the number of iterations to run. To minimize computing time while leaving time to explore, 10 is a good number of iterations. The script checkpoints every other iteration through the main loop. The exit status code for a checkpoint is 85. It prints some output to standard out along the way, to let you know what is going on. The final result is written to a separate file named fibonacci.result . This file does not exist until the very end of the complete run. It is safe to run from the command line on an access point: user@server $ ./fibonacci.py 10 If you run it, what happens? (Due to the 30-second delay, be patient.) Can you explain its behavior? What happens if you run it again, without changing any files in between? Why?","title":"Calculating Fibonacci numbers … slowly"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#preparing-to-run","text":"Now you have an executable and you know how to run it. It is time to prepare it for submission to HTCondor! Using what you know about the script (above), and using information in the slides from today, try writing a submit file that runs this software and implements exit-driven self-checkpointing. The Python code itself is ready and should not need any changes. Just use a plain queue statement, one job is enough to experiment on. Before you submit, read the next section first!","title":"Preparing to run"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#running-and-monitoring","text":"With the 30-second delay per iteration in the code and the suggested 10 iterations, once the script starts running you have about 5 minutes of runtime in which to see what is going on. So it may help to read through this section and then return here and submit your job. If your job has problems or finishes before you have the chance to do all the steps below, just remove the extra files (besides the Python script and your submit file) and try again!","title":"Running and monitoring"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#submission-and-first-checkpoint","text":"Submit the job Look at the contents of the submit directory \u2014 what changed? Start watching the log file: tail -n 100 -f YOUR-LOG-FILENAME.log Be patient! As HTCondor adds more lines to the end of your log file, they will appear automatically. Thus, nothing much will happen until HTCondor starts running your job. When it does, you will see three sets of messages in the log file quickly: Started transferring input files Finished transferring input files Job executing on host: (Of course, each message will contain a lot of other characters!) Now wait about 1 minute, and you should see two more messages appear: Started transferring output files Finished transferring output files That is the first checkpoint happening!","title":"Submission and first checkpoint"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#forcing-your-job-to-stop-running","text":"Now, assuming that your job is still running (check condor_q again), you can force HTCondor to remove ( evict ) your job before it finishes: Run condor_q to get the job ID of the running job Run condor_vacate_job JOB_ID , where you replace JOB_ID with your job ID from above Monitor the action again by running tail -n 100 -f YOUR-LOG-FILENAME.log","title":"Forcing your job to stop running"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#finishing-the-job-and-wrap-up","text":"Be patient again! You removed your running job, and so HTCondor put it back in the queue as idle. If you wait a minute or two, you should see that HTCondor starts running the job again. In the log file, look carefully for the two Job executing on host: messages. Does it seem like you ran on the same computer again or on a different one? Both are possible! Let your job finish running this time. There should be a Job terminated of its own accord message near the end. Did you get results? Go through all the files and see what they contain. The log and output files are probably the most interesting. But did you get a result file, too? Did the output file \u2014 that is, whatever file you named in the output line of your submit file \u2014 contain everything that you expected it to?","title":"Finishing the job and wrap-up"},{"location":"materials/checkpoint/part1-ex1-checkpointing/#conclusion","text":"This has been a brief and simple tour of self-checkpointing. If you would like to learn more, please read the Self-Checkpointing Applications section of the HTCondor Manual. Or talk to School staff about it. Or contact support@osg-htc.org for further help at any time.","title":"Conclusion"},{"location":"materials/data/part1-ex1-data-needs/","text":"Data Exercise 1.1: Understanding Data Requirements \u00b6 Exercise Goal \u00b6 This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a large batch of jobs or using tools for delivering large data to jobs. In this exercise we will attempt to understand the input and output of the bioinformatics application BLAST . Setup \u00b6 For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Create a directory for this exercise named blast-data and change into it Copy the Input Files \u00b6 To run BLAST, we need the executable, input file, and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Copy the BLAST executables: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ncbi-blast-2.12.0+-x64-linux.tar.gz user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz Download these files to your current directory: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: user@ap40 $ tar -xzvf pdbaa.tar.gz Understanding BLAST \u00b6 Remember that blastx is executed in a command like the following: user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db -query -out In the above, the is the name of a file containing a number of genetic sequences (e.g. mouse.fa ), and the database that these are compared against is made up of several files that begin with the same , (e.g. pdbaa/pdbaa ). The output from this analysis will be printed to that is also indicated in the command. Calculating Data Needs \u00b6 Using the files that you prepared in blast-data , we will calculate how much disk space is needed if we were to run a hypothetical BLAST job with a wrapper script, where the job: Transfers all of its input files (including the executable) as tarballs Untars the input files tarballs on the execute host Runs blastx using the untarred input files Here are some commands that will be useful for calculating your job's storage needs: List the size of a specific file: user@ap40 $ ls -lh List the sizes of all files in the current directory: user@ap40 $ ls -lh Sum the size of all files in a specific directory: user@ap40 $ du -sh Input requirements \u00b6 Total up the amount of data in all of the files necessary to run the blastx wrapper job, including the executable itself. Write down this number. Also take note of how much total data is in the pdbaa directory. Compressed Files Remember, blastx reads the un-compressed pdbaa files. Output requirements \u00b6 The output that we care about from blastx is saved in the file whose name is indicated after the -out argument to blastx . Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too. Are there any other files? Total all of these together, as well. Up next! \u00b6 Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. Next Exercise","title":"1.1 - Understanding a job's data needs"},{"location":"materials/data/part1-ex1-data-needs/#data-exercise-11-understanding-data-requirements","text":"","title":"Data Exercise 1.1: Understanding Data Requirements"},{"location":"materials/data/part1-ex1-data-needs/#exercise-goal","text":"This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a large batch of jobs or using tools for delivering large data to jobs. In this exercise we will attempt to understand the input and output of the bioinformatics application BLAST .","title":"Exercise Goal"},{"location":"materials/data/part1-ex1-data-needs/#setup","text":"For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Create a directory for this exercise named blast-data and change into it","title":"Setup"},{"location":"materials/data/part1-ex1-data-needs/#copy-the-input-files","text":"To run BLAST, we need the executable, input file, and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Copy the BLAST executables: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ncbi-blast-2.12.0+-x64-linux.tar.gz user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz Download these files to your current directory: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: user@ap40 $ tar -xzvf pdbaa.tar.gz","title":"Copy the Input Files"},{"location":"materials/data/part1-ex1-data-needs/#understanding-blast","text":"Remember that blastx is executed in a command like the following: user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db -query -out In the above, the is the name of a file containing a number of genetic sequences (e.g. mouse.fa ), and the database that these are compared against is made up of several files that begin with the same , (e.g. pdbaa/pdbaa ). The output from this analysis will be printed to that is also indicated in the command.","title":"Understanding BLAST"},{"location":"materials/data/part1-ex1-data-needs/#calculating-data-needs","text":"Using the files that you prepared in blast-data , we will calculate how much disk space is needed if we were to run a hypothetical BLAST job with a wrapper script, where the job: Transfers all of its input files (including the executable) as tarballs Untars the input files tarballs on the execute host Runs blastx using the untarred input files Here are some commands that will be useful for calculating your job's storage needs: List the size of a specific file: user@ap40 $ ls -lh List the sizes of all files in the current directory: user@ap40 $ ls -lh Sum the size of all files in a specific directory: user@ap40 $ du -sh ","title":"Calculating Data Needs"},{"location":"materials/data/part1-ex1-data-needs/#input-requirements","text":"Total up the amount of data in all of the files necessary to run the blastx wrapper job, including the executable itself. Write down this number. Also take note of how much total data is in the pdbaa directory. Compressed Files Remember, blastx reads the un-compressed pdbaa files.","title":"Input requirements"},{"location":"materials/data/part1-ex1-data-needs/#output-requirements","text":"The output that we care about from blastx is saved in the file whose name is indicated after the -out argument to blastx . Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too. Are there any other files? Total all of these together, as well.","title":"Output requirements"},{"location":"materials/data/part1-ex1-data-needs/#up-next","text":"Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. Next Exercise","title":"Up next!"},{"location":"materials/data/part1-ex2-file-transfer/","text":"Data Exercise 1.2: transfer_input_files, transfer_output_files, and remaps \u00b6 Exercise Goal \u00b6 The objective of this exercise is to refresh yourself on HTCondor file transfer, to implement file compression, and to begin examining the memory and disk space used by your jobs in order to plan larger batches. We will also explore ways to deal with output data. Setup \u00b6 The executable we'll use in this exercise and later today is the same blastx executable from previous exercises. Log in to ap40: $ ssh @ap40.uw.osg-htc.org Then change into the blast-data folder that you created in the previous exercise. Review: HTCondor File Transfer \u00b6 Recall that OSG does NOT have a shared filesystem! Instead, HTCondor transfers your executable and input files (specified with the executable and transfer_input_files submit file directives, respectively) to a working directory on the execute node, regardless of how these files were arranged on the submit node. In this exercise we'll use the same blastx example job that we used previously, but modify the submit file and test how much memory and disk space it uses on the execute node. Start with a test submit file \u00b6 We've started a submit file for you, below, which you'll add to in the remaining steps. executable = transfer_input_files = output = test.out error = test.err log = test.log request_memory = request_disk = request_cpus = 1 requirements = (OSGVO_OS_STRING == \"RHEL 9\") queue Implement file compression \u00b6 In our first blast job from the Software exercises ( 1.1 ), the database files in the pdbaa directory were all transferred, as is, but we could instead transfer them as a single, compressed file using tar . For this version of the job, let's compress our blast database files to send them to the submit node as a single tar.gz file (otherwise known as a tarball), by following the below steps: Change into the pdbaa directory and compress the database files into a single file called pdbaa_files.tar.gz using the tar command. Note that this file will be different from the pdbaa.tar.gz file that you used earlier, because it will only contain the pdbaa files, and not the pdbaa directory, itself.) Remember, a typical command for creating a tar file is: user@ap40 $ tar -cvzf Replacing with the name of the tarball that you would like to create and with a space-separated list of files and/or directories that you want inside pdbaa_files.tar.gz. Move the resulting tarball to the blast-data directory. Create a wrapper script that will first decompress the pdbaa_files.tar.gz file, and then run blast. Because this file will now be our executable in the submit file, we'll also end up transferring the blastx executable with transfer_input_files . In the blast-data directory, create a new file, called blast_wrapper.sh , with the following contents: #!/bin/bash tar -xzvf pdbaa_files.tar.gz ./blastx -db pdbaa -query mouse.fa -out mouse.fa.result rm pdbaa.* Also remember to make the script executable: chmod +x blast_wrapper.sh Extra Files! The last line removes the resulting database files that came from pdbaa_files.tar.gz , as these files would otherwise be copied back to the submit server as perceived output since they're \"new\" files that HTCondor didn't transfer over as input. List the executable and input files \u00b6 Make sure to update the submit file with the following: Add the new executable (the wrapper script you created above) In transfer_input_files , list the blastx binary, the pdbaa_files.tar.gz file, and the input query file. Commas, commas everywhere! Remember that transfer_input_files accepts a comma separated list of files, and that you need to list the full location of the blastx executable ( blastx ). There will be no arguments, since the arguments to the blastx command are now captured in the wrapper script. Predict memory and disk requests from your data \u00b6 Also, think about how much memory and disk to request for this job. It's good to start with values that are a little higher than you think a test job will need, but think about: How much memory blastx would use if it loaded all of the database files and the query input file into memory. How much disk space will be necessary on the execute server for the executable, all input files, and all output files (hint: the log file only exists on the submit node). Whether you'd like to request some extra memory or disk space, just in case Look at the log file for your blastx job from Software exercise ( 1.1 ), and compare the memory and disk \"Usage\" to what you predicted from the files. Make sure to update the submit file with more accurate memory and disk requests. You may still want to request slightly more than the job actually used. Run the test job \u00b6 Once you have finished editing the submit file, go ahead and submit the job. It should take a few minutes to complete, and then you can check to make sure that no unwanted files (especially the pdbaa database files) were copied back at the end of the job. Run a du -sh on the directory with this job's input. How does it compare to the directory from Software exercise ( 1.1 ), and why? transfer_output_files \u00b6 So far, we have used HTCondor's new file detection to transfer back the newly created files. An alternative is to be explicit, using the transfer_output_files attribute in the submit file. The upside to this approach is that you can pick to only transfer back a subset of the created files. The downside is that you have to know which files are created. The first exercise is to modify the submit file from the previous example, and add a line like (remember, before the queue ): transfer_output_files = mouse.fa.result You may also remove the last line in the blast_wrapper.sh , the rm pdbaa.* as extra files are no longer an issue - those files will be ignored because we used transfer_output_files . Submit the job, and make sure everything works. Did you get any pdbaa.* files back? The next thing we should try is to see what happens if the file we specify does not exist. Modify your submit file, and change the transfer_output_files to: transfer_output_files = elephant.fa.result Submit the job and see how it behaves. Did it finish successfully? transfer_output_remaps \u00b6 Related to transfer_output_files is transfer_output_remaps , which allows us to rename outputs, or map the outputs to a different storage system (will be explored in the next module). The format of the transfer_output_remaps attribute is a list of remaps, each remap taking the form of src=dst . The destination can be a local path, or a URL. For example: transfer_output_remaps = \"myresults.dat = s3://destination-server.com/myresults.dat\" If you have more than one remap, you can separate them with ; By now, your blast-data directory is probably starting to look messy with a mix of submit files, input data, log file and output data all intermingled. One improvement could be to map our outputs to a separate directory. Create a new directory named science-results . Add a transfer_output_remaps line to the submit file. It is common to place this line right after the transfer_output_files line. Change the transfer_output_files back to mouse.fa.result . Example: transfer_output_files = mouse.fa.result transfer_output_remaps = Fill out the remap line, mapping mouse.fa.result to the destination science-results/mouse.fa.result . Remember that the transfer_output_remaps value requires double quotes around it. Submit the job, and wait for it to complete. Are there any errors? Can you find mouse.fa.result? Conclusions \u00b6 In this exercise, you: Used your data requirements knowledge from the previous exercise to write a job. Executed the job on a remote worker node and took note of the data usage. Used transfer_input_files to transfer inputs Used transfer_output_files to transfer outputs Used transfer_output_remaps to map outputs to a different destination When you've completed the above, continue with the next exercise .","title":"1.2 - transfer_input_files, transfer_output_files, and remaps"},{"location":"materials/data/part1-ex2-file-transfer/#data-exercise-12-transfer_input_files-transfer_output_files-and-remaps","text":"","title":"Data Exercise 1.2: transfer_input_files, transfer_output_files, and remaps"},{"location":"materials/data/part1-ex2-file-transfer/#exercise-goal","text":"The objective of this exercise is to refresh yourself on HTCondor file transfer, to implement file compression, and to begin examining the memory and disk space used by your jobs in order to plan larger batches. We will also explore ways to deal with output data.","title":"Exercise Goal"},{"location":"materials/data/part1-ex2-file-transfer/#setup","text":"The executable we'll use in this exercise and later today is the same blastx executable from previous exercises. Log in to ap40: $ ssh @ap40.uw.osg-htc.org Then change into the blast-data folder that you created in the previous exercise.","title":"Setup"},{"location":"materials/data/part1-ex2-file-transfer/#review-htcondor-file-transfer","text":"Recall that OSG does NOT have a shared filesystem! Instead, HTCondor transfers your executable and input files (specified with the executable and transfer_input_files submit file directives, respectively) to a working directory on the execute node, regardless of how these files were arranged on the submit node. In this exercise we'll use the same blastx example job that we used previously, but modify the submit file and test how much memory and disk space it uses on the execute node.","title":"Review: HTCondor File Transfer"},{"location":"materials/data/part1-ex2-file-transfer/#start-with-a-test-submit-file","text":"We've started a submit file for you, below, which you'll add to in the remaining steps. executable = transfer_input_files = output = test.out error = test.err log = test.log request_memory = request_disk = request_cpus = 1 requirements = (OSGVO_OS_STRING == \"RHEL 9\") queue","title":"Start with a test submit file"},{"location":"materials/data/part1-ex2-file-transfer/#implement-file-compression","text":"In our first blast job from the Software exercises ( 1.1 ), the database files in the pdbaa directory were all transferred, as is, but we could instead transfer them as a single, compressed file using tar . For this version of the job, let's compress our blast database files to send them to the submit node as a single tar.gz file (otherwise known as a tarball), by following the below steps: Change into the pdbaa directory and compress the database files into a single file called pdbaa_files.tar.gz using the tar command. Note that this file will be different from the pdbaa.tar.gz file that you used earlier, because it will only contain the pdbaa files, and not the pdbaa directory, itself.) Remember, a typical command for creating a tar file is: user@ap40 $ tar -cvzf Replacing with the name of the tarball that you would like to create and with a space-separated list of files and/or directories that you want inside pdbaa_files.tar.gz. Move the resulting tarball to the blast-data directory. Create a wrapper script that will first decompress the pdbaa_files.tar.gz file, and then run blast. Because this file will now be our executable in the submit file, we'll also end up transferring the blastx executable with transfer_input_files . In the blast-data directory, create a new file, called blast_wrapper.sh , with the following contents: #!/bin/bash tar -xzvf pdbaa_files.tar.gz ./blastx -db pdbaa -query mouse.fa -out mouse.fa.result rm pdbaa.* Also remember to make the script executable: chmod +x blast_wrapper.sh Extra Files! The last line removes the resulting database files that came from pdbaa_files.tar.gz , as these files would otherwise be copied back to the submit server as perceived output since they're \"new\" files that HTCondor didn't transfer over as input.","title":"Implement file compression"},{"location":"materials/data/part1-ex2-file-transfer/#list-the-executable-and-input-files","text":"Make sure to update the submit file with the following: Add the new executable (the wrapper script you created above) In transfer_input_files , list the blastx binary, the pdbaa_files.tar.gz file, and the input query file. Commas, commas everywhere! Remember that transfer_input_files accepts a comma separated list of files, and that you need to list the full location of the blastx executable ( blastx ). There will be no arguments, since the arguments to the blastx command are now captured in the wrapper script.","title":"List the executable and input files"},{"location":"materials/data/part1-ex2-file-transfer/#predict-memory-and-disk-requests-from-your-data","text":"Also, think about how much memory and disk to request for this job. It's good to start with values that are a little higher than you think a test job will need, but think about: How much memory blastx would use if it loaded all of the database files and the query input file into memory. How much disk space will be necessary on the execute server for the executable, all input files, and all output files (hint: the log file only exists on the submit node). Whether you'd like to request some extra memory or disk space, just in case Look at the log file for your blastx job from Software exercise ( 1.1 ), and compare the memory and disk \"Usage\" to what you predicted from the files. Make sure to update the submit file with more accurate memory and disk requests. You may still want to request slightly more than the job actually used.","title":"Predict memory and disk requests from your data"},{"location":"materials/data/part1-ex2-file-transfer/#run-the-test-job","text":"Once you have finished editing the submit file, go ahead and submit the job. It should take a few minutes to complete, and then you can check to make sure that no unwanted files (especially the pdbaa database files) were copied back at the end of the job. Run a du -sh on the directory with this job's input. How does it compare to the directory from Software exercise ( 1.1 ), and why?","title":"Run the test job"},{"location":"materials/data/part1-ex2-file-transfer/#transfer_output_files","text":"So far, we have used HTCondor's new file detection to transfer back the newly created files. An alternative is to be explicit, using the transfer_output_files attribute in the submit file. The upside to this approach is that you can pick to only transfer back a subset of the created files. The downside is that you have to know which files are created. The first exercise is to modify the submit file from the previous example, and add a line like (remember, before the queue ): transfer_output_files = mouse.fa.result You may also remove the last line in the blast_wrapper.sh , the rm pdbaa.* as extra files are no longer an issue - those files will be ignored because we used transfer_output_files . Submit the job, and make sure everything works. Did you get any pdbaa.* files back? The next thing we should try is to see what happens if the file we specify does not exist. Modify your submit file, and change the transfer_output_files to: transfer_output_files = elephant.fa.result Submit the job and see how it behaves. Did it finish successfully?","title":"transfer_output_files"},{"location":"materials/data/part1-ex2-file-transfer/#transfer_output_remaps","text":"Related to transfer_output_files is transfer_output_remaps , which allows us to rename outputs, or map the outputs to a different storage system (will be explored in the next module). The format of the transfer_output_remaps attribute is a list of remaps, each remap taking the form of src=dst . The destination can be a local path, or a URL. For example: transfer_output_remaps = \"myresults.dat = s3://destination-server.com/myresults.dat\" If you have more than one remap, you can separate them with ; By now, your blast-data directory is probably starting to look messy with a mix of submit files, input data, log file and output data all intermingled. One improvement could be to map our outputs to a separate directory. Create a new directory named science-results . Add a transfer_output_remaps line to the submit file. It is common to place this line right after the transfer_output_files line. Change the transfer_output_files back to mouse.fa.result . Example: transfer_output_files = mouse.fa.result transfer_output_remaps = Fill out the remap line, mapping mouse.fa.result to the destination science-results/mouse.fa.result . Remember that the transfer_output_remaps value requires double quotes around it. Submit the job, and wait for it to complete. Are there any errors? Can you find mouse.fa.result?","title":"transfer_output_remaps"},{"location":"materials/data/part1-ex2-file-transfer/#conclusions","text":"In this exercise, you: Used your data requirements knowledge from the previous exercise to write a job. Executed the job on a remote worker node and took note of the data usage. Used transfer_input_files to transfer inputs Used transfer_output_files to transfer outputs Used transfer_output_remaps to map outputs to a different destination When you've completed the above, continue with the next exercise .","title":"Conclusions"},{"location":"materials/data/part1-ex3-blast-split/","text":"Data Exercise 1.3: Splitting Large Input for Better Throughput \u00b6 The objective of this exercise is to prepare for blasting a much larger input query file by splitting the input for greater throughput and lower memory and disk requirements. Splitting the input will also mean that we don't have to rely on additional large-data measures for the input query files. Setup \u00b6 Log in to ap40.uw.osg-htc.org Create a directory for this exercise named blast-split and change into it. Copy over the following files from the previous exercise : Your submit file blastx pdbaa_files.tar.gz blast_wrapper.sh Remember to modify the submit file for the new locations of the above files. Obtain the large input \u00b6 We've previously used blastx to analyze a relatively small input file of test data, mouse.fa , but let's imagine that you now need to blast a much larger dataset for your research. This dataset can be downloaded with the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse_rna.tar.gz After un-tar'ing ( tar xzf mouse_rna.tar.gz ) the file, you should be able to confirm that it's size is roughly 100 MB. Not only is this near the size cutoff for HTCondor file transfer, it would take hours to complete a single blastx analysis for it and the resulting output file would be huge. Split the input file \u00b6 For blast , it's scientifically valid to split up the input query file, analyze the pieces, and then put the results back together at the end! On the other hand, BLAST databases should not be split, because the blast output includes a score value for each sequence that is calculated relative to the entire length of the database. Because genetic sequence data is used heavily across the life sciences, there are also tools for splitting up the data into smaller files. One of these is called genome tools , and you can download a package of precompiled binaries (just like BLAST) using the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/gt-1.5.10-Linux_x86_64-64bit-complete.tar.gz Un-tar the gt package ( tar -xzvf ... ), then run its sequence file splitter as follows, with the target file size of 1MB: user@ap40 $ ./gt-1.5.10-Linux_x86_64-64bit-complete/bin/gt splitfasta -targetsize 1 mouse_rna.fa You'll notice that the result is a set of 100 files, all about the size of 1 MB, and numbered 1 through 100. Run a Jobs on Split Input \u00b6 Now, you'll submit jobs on the split input files, where each job will use a different piece of the large original input file. Modify the submit file \u00b6 First, you'll create a new submit file that passes the input filename as an argument and use a list of applicable filenames. Follow the below steps: Copy the submit file from the previous exercise to a new file called blast_split.sub and modify the \"queue\" line of the submit file to the following: queue inputfile matching mouse_rna.fa.* Replace the mouse.fa instances in the submit file with $(inputfile) , and rename the output, log, and error files to use the same inputfile variable: output = $(inputfile).out error = $(inputfile).err log = $(inputfile).log Add an arguments line to the submit file so it will pass the name of the input file to the wrapper script arguments = $(inputfile) Add the $(inputfile) to the end of your list of transfer_input_files : transfer_input_files = ... , $(inputfile) Remove or comment out transfer_output_files and transfer_output_remaps . Update the memory and disk requests, since the new input file is larger and will also produce larger output. It may be best to overestimate to something like 1 GB for each. Modify the wrapper file \u00b6 Replace instances of the input file name in the blast_wrapper.sh script so that it will insert the first argument in place of the input filename, like so: ./blastx -db pdbaa -query $1 -out $1.result Note Bash shell scripts will use the first argument in place of $1 , the second argument as $2 , etc. Submit the jobs \u00b6 This job will take a bit longer than the job in the last exercise, since the input file is larger (by about 3-fold). Again, make sure that only the desired output , error , and result files come back at the end of the job. In our tests, the jobs ran for ~15 minutes. Jobs on jobs! Be careful to not submit the job again. Why? Our queue statement says ... matching mouse_rna.fa.* , and look at the current directory. There are new files named mouse_rna.fa.X.log and other files. Submitting again, the queue statement would see these new files, and try to run blast on them! If you want to remove all of the extra files, you can try: user@ap40 $ rm *.err *.log *.out *.result Update the resource requests \u00b6 After the job finishes successfully, examine the log file for memory and disk usage, and update the requests in the submit file.","title":"1.3- Splitting input"},{"location":"materials/data/part1-ex3-blast-split/#data-exercise-13-splitting-large-input-for-better-throughput","text":"The objective of this exercise is to prepare for blasting a much larger input query file by splitting the input for greater throughput and lower memory and disk requirements. Splitting the input will also mean that we don't have to rely on additional large-data measures for the input query files.","title":"Data Exercise 1.3: Splitting Large Input for Better Throughput"},{"location":"materials/data/part1-ex3-blast-split/#setup","text":"Log in to ap40.uw.osg-htc.org Create a directory for this exercise named blast-split and change into it. Copy over the following files from the previous exercise : Your submit file blastx pdbaa_files.tar.gz blast_wrapper.sh Remember to modify the submit file for the new locations of the above files.","title":"Setup"},{"location":"materials/data/part1-ex3-blast-split/#obtain-the-large-input","text":"We've previously used blastx to analyze a relatively small input file of test data, mouse.fa , but let's imagine that you now need to blast a much larger dataset for your research. This dataset can be downloaded with the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse_rna.tar.gz After un-tar'ing ( tar xzf mouse_rna.tar.gz ) the file, you should be able to confirm that it's size is roughly 100 MB. Not only is this near the size cutoff for HTCondor file transfer, it would take hours to complete a single blastx analysis for it and the resulting output file would be huge.","title":"Obtain the large input"},{"location":"materials/data/part1-ex3-blast-split/#split-the-input-file","text":"For blast , it's scientifically valid to split up the input query file, analyze the pieces, and then put the results back together at the end! On the other hand, BLAST databases should not be split, because the blast output includes a score value for each sequence that is calculated relative to the entire length of the database. Because genetic sequence data is used heavily across the life sciences, there are also tools for splitting up the data into smaller files. One of these is called genome tools , and you can download a package of precompiled binaries (just like BLAST) using the following command: user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/gt-1.5.10-Linux_x86_64-64bit-complete.tar.gz Un-tar the gt package ( tar -xzvf ... ), then run its sequence file splitter as follows, with the target file size of 1MB: user@ap40 $ ./gt-1.5.10-Linux_x86_64-64bit-complete/bin/gt splitfasta -targetsize 1 mouse_rna.fa You'll notice that the result is a set of 100 files, all about the size of 1 MB, and numbered 1 through 100.","title":"Split the input file"},{"location":"materials/data/part1-ex3-blast-split/#run-a-jobs-on-split-input","text":"Now, you'll submit jobs on the split input files, where each job will use a different piece of the large original input file.","title":"Run a Jobs on Split Input"},{"location":"materials/data/part1-ex3-blast-split/#modify-the-submit-file","text":"First, you'll create a new submit file that passes the input filename as an argument and use a list of applicable filenames. Follow the below steps: Copy the submit file from the previous exercise to a new file called blast_split.sub and modify the \"queue\" line of the submit file to the following: queue inputfile matching mouse_rna.fa.* Replace the mouse.fa instances in the submit file with $(inputfile) , and rename the output, log, and error files to use the same inputfile variable: output = $(inputfile).out error = $(inputfile).err log = $(inputfile).log Add an arguments line to the submit file so it will pass the name of the input file to the wrapper script arguments = $(inputfile) Add the $(inputfile) to the end of your list of transfer_input_files : transfer_input_files = ... , $(inputfile) Remove or comment out transfer_output_files and transfer_output_remaps . Update the memory and disk requests, since the new input file is larger and will also produce larger output. It may be best to overestimate to something like 1 GB for each.","title":"Modify the submit file"},{"location":"materials/data/part1-ex3-blast-split/#modify-the-wrapper-file","text":"Replace instances of the input file name in the blast_wrapper.sh script so that it will insert the first argument in place of the input filename, like so: ./blastx -db pdbaa -query $1 -out $1.result Note Bash shell scripts will use the first argument in place of $1 , the second argument as $2 , etc.","title":"Modify the wrapper file"},{"location":"materials/data/part1-ex3-blast-split/#submit-the-jobs","text":"This job will take a bit longer than the job in the last exercise, since the input file is larger (by about 3-fold). Again, make sure that only the desired output , error , and result files come back at the end of the job. In our tests, the jobs ran for ~15 minutes. Jobs on jobs! Be careful to not submit the job again. Why? Our queue statement says ... matching mouse_rna.fa.* , and look at the current directory. There are new files named mouse_rna.fa.X.log and other files. Submitting again, the queue statement would see these new files, and try to run blast on them! If you want to remove all of the extra files, you can try: user@ap40 $ rm *.err *.log *.out *.result","title":"Submit the jobs"},{"location":"materials/data/part1-ex3-blast-split/#update-the-resource-requests","text":"After the job finishes successfully, examine the log file for memory and disk usage, and update the requests in the submit file.","title":"Update the resource requests"},{"location":"materials/data/part2-ex1-osdf-inputs/","text":"Data Exercise 2.1: Using OSDF for Large Shared Data \u00b6 This exercise will use a BLAST workflow to demonstrate the functionality of OSDF for transferring input files to jobs on OSG. Because our individual blast jobs from previous exercises would take a bit longer with a larger database (too long for an workable exercise), we'll imagine for this exercise that our pdbaa_files.tar.gz file is too large for transfer_input_files (larger than ~1 GB). For this exercise, we will use the same inputs, but instead of using transfer_input_files for the pdbaa database, we will place it in OSDF and have the jobs download from there. OSDF is connected to a distributed set of caches spread across the U.S. They are connected with high bandwidth connections to each other, and to the data origin servers, where your data is originally placed. Setup \u00b6 Make sure you're logged in to ap40.uw.osg-htc.org Copy the following files from the previous Blast exercises to a new directory in /home/ called osdf-shared : blast_wrapper.sh blastx mouse_rna.fa.1 mouse_rna.fa.2 mouse_rna.fa.3 Your most recent submit file (probably named blast_split.sub ) Place the Database in OSDF \u00b6 Copy to your data to the OSDF space \u00b6 OSDF provides a directory for you to store data which can be accessed through the caching servers. First, you need to move your BLAST database ( pdbaa_files.tar.gz ) into this directory. For ap40.uw.osg-htc.org , the directory to use is /ospool/ap40/data/[USERNAME]/ Note that files placed in the /ospool/ap40/data/[USERNAME]/ directory will only be accessible by your own jobs. Modify the Submit File and Wrapper \u00b6 You will have to modify the wrapper and submit file to use OSDF: HTCondor knows how to do OSDF transfers, so you just have to provide the correct URL in transfer_input_files . Note there is no servername (3 slashes in :///) and we instead is is just based on namespace ( /ospool/ap40 in this case): transfer_input_files = blastx, $(inputfile), osdf:///ospool/ap40/data/[USERNAME]/pdbaa_files.tar.gz Confirm that your queue statement is correct for the current directory. It should be something like: queue inputfile matching mouse_rna.fa.* And that mouse_rna.fa.* files exist in the current directory (you should have copied a few them from the previous exercise directory). Submit the Job \u00b6 Now submit and monitor the job! If your 100 jobs from the previous exercise haven't started running yet, this job will not yet start. However, after it has been running for ~2 minutes, you're safe to continue to the next exercise! Considerations \u00b6 Why did we not place all files in OSDF (for example, blastx and mouse_rna.fa.* )? What do you think will happen if you make changes to pdbaa_files.tar.gz ? Will the caches be updated automatically, or is there a possiblility that the old version of pdbaa_files.tar.gz will be served up to jobs? What is the solution to this problem? (Hint: OSDF only considers the filename when caching data) Note: Keeping OSDF 'Clean' \u00b6 Just as for any data directory, it is VERY important to remove old files from OSDF when you no longer need them, especially so that you'll have plenty of space for such files in the future. For example, you would delete ( rm ) files from /ospool/ap40/data/[USERNAME]/ on when you don't need them there anymore, but only after all jobs have finished. The next time you use OSDF after the school, remember to first check for old files that you can delete. Next exercise \u00b6 Once completed, move onto the next exercise: Using OSDF for outputs","title":"2.1 - OSDF for inputs"},{"location":"materials/data/part2-ex1-osdf-inputs/#data-exercise-21-using-osdf-for-large-shared-data","text":"This exercise will use a BLAST workflow to demonstrate the functionality of OSDF for transferring input files to jobs on OSG. Because our individual blast jobs from previous exercises would take a bit longer with a larger database (too long for an workable exercise), we'll imagine for this exercise that our pdbaa_files.tar.gz file is too large for transfer_input_files (larger than ~1 GB). For this exercise, we will use the same inputs, but instead of using transfer_input_files for the pdbaa database, we will place it in OSDF and have the jobs download from there. OSDF is connected to a distributed set of caches spread across the U.S. They are connected with high bandwidth connections to each other, and to the data origin servers, where your data is originally placed.","title":"Data Exercise 2.1: Using OSDF for Large Shared Data"},{"location":"materials/data/part2-ex1-osdf-inputs/#setup","text":"Make sure you're logged in to ap40.uw.osg-htc.org Copy the following files from the previous Blast exercises to a new directory in /home/ called osdf-shared : blast_wrapper.sh blastx mouse_rna.fa.1 mouse_rna.fa.2 mouse_rna.fa.3 Your most recent submit file (probably named blast_split.sub )","title":"Setup"},{"location":"materials/data/part2-ex1-osdf-inputs/#place-the-database-in-osdf","text":"","title":"Place the Database in OSDF"},{"location":"materials/data/part2-ex1-osdf-inputs/#copy-to-your-data-to-the-osdf-space","text":"OSDF provides a directory for you to store data which can be accessed through the caching servers. First, you need to move your BLAST database ( pdbaa_files.tar.gz ) into this directory. For ap40.uw.osg-htc.org , the directory to use is /ospool/ap40/data/[USERNAME]/ Note that files placed in the /ospool/ap40/data/[USERNAME]/ directory will only be accessible by your own jobs.","title":"Copy to your data to the OSDF space"},{"location":"materials/data/part2-ex1-osdf-inputs/#modify-the-submit-file-and-wrapper","text":"You will have to modify the wrapper and submit file to use OSDF: HTCondor knows how to do OSDF transfers, so you just have to provide the correct URL in transfer_input_files . Note there is no servername (3 slashes in :///) and we instead is is just based on namespace ( /ospool/ap40 in this case): transfer_input_files = blastx, $(inputfile), osdf:///ospool/ap40/data/[USERNAME]/pdbaa_files.tar.gz Confirm that your queue statement is correct for the current directory. It should be something like: queue inputfile matching mouse_rna.fa.* And that mouse_rna.fa.* files exist in the current directory (you should have copied a few them from the previous exercise directory).","title":"Modify the Submit File and Wrapper"},{"location":"materials/data/part2-ex1-osdf-inputs/#submit-the-job","text":"Now submit and monitor the job! If your 100 jobs from the previous exercise haven't started running yet, this job will not yet start. However, after it has been running for ~2 minutes, you're safe to continue to the next exercise!","title":"Submit the Job"},{"location":"materials/data/part2-ex1-osdf-inputs/#considerations","text":"Why did we not place all files in OSDF (for example, blastx and mouse_rna.fa.* )? What do you think will happen if you make changes to pdbaa_files.tar.gz ? Will the caches be updated automatically, or is there a possiblility that the old version of pdbaa_files.tar.gz will be served up to jobs? What is the solution to this problem? (Hint: OSDF only considers the filename when caching data)","title":"Considerations"},{"location":"materials/data/part2-ex1-osdf-inputs/#note-keeping-osdf-clean","text":"Just as for any data directory, it is VERY important to remove old files from OSDF when you no longer need them, especially so that you'll have plenty of space for such files in the future. For example, you would delete ( rm ) files from /ospool/ap40/data/[USERNAME]/ on when you don't need them there anymore, but only after all jobs have finished. The next time you use OSDF after the school, remember to first check for old files that you can delete.","title":"Note: Keeping OSDF 'Clean'"},{"location":"materials/data/part2-ex1-osdf-inputs/#next-exercise","text":"Once completed, move onto the next exercise: Using OSDF for outputs","title":"Next exercise"},{"location":"materials/data/part2-ex2-osdf-outputs/","text":"Data Exercise 2.2: Using OSDF for outputs \u00b6 In this exercise, we will run a multimedia program that converts and manipulates video files. In particular, we want to convert large .mov files to smaller (10-100s of MB) mp4 files. Just like the Blast database in the previous exercise , these video files are potentially too large to send to jobs using HTCondor's default file transfer for inputs/outputs, so we will use OSDF. Data \u00b6 To get the exercise set up: Log into ap40.uw.osg-htc.org Create a directory for this exercise named osdf-outputs and change into it. Download the input data and store it under the OSDF directory ( cd to that directory first): user@ap40 $ cd /ospool/ap40/data/ [ USERNAME ] / user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ducks.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/teaching.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/test_open_terminal.mov We're going to need a list of these files later. Below is the final list of movie files. cd back to your osdf-outputs directory and create a file named movie_list.txt , with the following content: ducks.mov teaching.mov test_open_terminal.mov Software \u00b6 We'll be using a multi-purpose media tool called ffmpeg to convert video formats. The basic command to convert a file looks like this: user@ap40 $ ./ffmpeg -i input.mov output.mp4 In order to resize our files, we're going to manually set the video bitrate and resize the frames, so that the resulting file is smaller. user@ap40 $ ./ffmpeg -i input.mp4 -b:v 400k -s 640x360 output.mp4 To get the ffmpeg binary do the following: We'll be downloading the ffmpeg pre-built static binary originally from this page: http://johnvansickle.com/ffmpeg/ . user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ffmpeg-release-64bit-static.tar.xz Once the binary is downloaded, un-tar it, and then copy the main ffmpeg program into your current directory: user@ap40 $ tar -xf ffmpeg-release-64bit-static.tar.xz user@ap40 $ cp ffmpeg-4.0.1-64bit-static/ffmpeg ./ Script \u00b6 We want to write a script that runs on the worker node that uses ffmpeg to convert a .mov file to a smaller format. Our script will need to run the proper executable. Create a file called run_ffmpeg.sh , that does the steps described above. Use the name of the smallest .mov file in the ffmpeg command. An example of that script is below: #!/bin/bash ./ffmpeg -i test_open_terminal.mov -b:v 400k -s 640x360 test_open_terminal.mp4 Ultimately we'll want to submit several jobs (one for each .mov file), but to start with, we'll run one job to make sure that everything works. Remember to chmod +x run_ffmpeg.sh to make the script executable. Submit File \u00b6 Create a submit file for this job, based on other submit files from the school. Things to consider: We'll be copying the video file into the job's working directory from OSDF, so make sure to request enough disk space for the input mov file and the output mp4 file. If you're aren't sure how much to request, ask a helper. Add the same requirements as the previous exercise: requirements = (OSGVO_OS_STRING == \"RHEL 9\") We need to transfer the ffmpeg program that we downloaded above, and the movie from OSDF: transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mov Transfer outputs via OSDF. This requires a transfer remap: transfer_output_files = test_open_terminal.mp4 transfer_output_remaps = \"test_open_terminal.mp4 = osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mp4\" Initial Job \u00b6 With everything in place, submit the job. Once it finishes, we should check to make sure everything ran as expected: Check the OSDF directory. Did the output .mp4 file return? Check file sizes. How big is the returned .mp4 file? How does that compare to the original .mov input? If your job successfully returned the converted .mp4 file and did not transfer the .mov file to the submit server, and the .mp4 file was appropriately scaled down, then we can go ahead and convert all of the files we uploaded to OSDF. Multiple jobs \u00b6 We wrote the name of the .mov file into our run_ffmpeg.sh executable script. To submit a set of jobs for all of our .mov files, what will we need to change in: The script? The submit file? Once you've thought about it, check your reasoning against the instructions below. Add an argument to your script \u00b6 Look at your run_ffmpeg.sh script. What values will change for every job? The input file will change with every job - and don't forget that the output file will too! Let's make them both into arguments. To add arguments to a bash script, we use the notation $1 for the first argument (our input file) and $2 for the second argument (our output file name). The final script should look like this: #!/bin/bash ./ffmpeg -i $1 -b:v 400k -s 640x360 $2 Modify your submit file \u00b6 We now need to tell each job what arguments to use. We will do this by adding an arguments line to our submit file. Because we'll only have the input file name, the \"output\" file name will be the input file name with the mp4 extension. That should look like this: arguments = $(mov) $(mov).mp4 Update the transfer_input_files to have $(mov) : transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/$(mov) Similarly, update the output/remap with $(mov).mp4 : transfer_output_files = $(mov).mp4 transfer_output_remaps = \"$(mov).mp4 = osdf:///ospool/ap40/data/[USERNAME]/$(mov).mp4\" To set these arguments, we will use the queue .. from syntax. In our submit file, we can then change our queue statement to: queue mov from movie_list.txt Once you've made these changes, try submitting all the jobs! Bonus \u00b6 If you wanted to set a different output file name, bitrate and/or size for each original movie, how could you modify: movie_list.txt Your submit file run_ffmpeg.sh to do so? Show hint Here's the changes you can make to the various files: movie_list.txt ducks.mov ducks.mp4 500k 1280x720 teaching.mov teaching.mp4 400k 320x180 test_open_terminal.mov terminal.mp4 600k 640x360 Submit file arguments = $(mov) $(mp4) $(bitrate) $(size) queue mov,mp4,bitrate,size from movie_list.txt run_ffmpeg.sh 1 2 #!/bin/bash ./ffmpeg -i $1 -b:v $3 -s $4 $2","title":"2.2 - OSDF for outputs"},{"location":"materials/data/part2-ex2-osdf-outputs/#data-exercise-22-using-osdf-for-outputs","text":"In this exercise, we will run a multimedia program that converts and manipulates video files. In particular, we want to convert large .mov files to smaller (10-100s of MB) mp4 files. Just like the Blast database in the previous exercise , these video files are potentially too large to send to jobs using HTCondor's default file transfer for inputs/outputs, so we will use OSDF.","title":"Data Exercise 2.2: Using OSDF for outputs"},{"location":"materials/data/part2-ex2-osdf-outputs/#data","text":"To get the exercise set up: Log into ap40.uw.osg-htc.org Create a directory for this exercise named osdf-outputs and change into it. Download the input data and store it under the OSDF directory ( cd to that directory first): user@ap40 $ cd /ospool/ap40/data/ [ USERNAME ] / user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ducks.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/teaching.mov user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/test_open_terminal.mov We're going to need a list of these files later. Below is the final list of movie files. cd back to your osdf-outputs directory and create a file named movie_list.txt , with the following content: ducks.mov teaching.mov test_open_terminal.mov","title":"Data"},{"location":"materials/data/part2-ex2-osdf-outputs/#software","text":"We'll be using a multi-purpose media tool called ffmpeg to convert video formats. The basic command to convert a file looks like this: user@ap40 $ ./ffmpeg -i input.mov output.mp4 In order to resize our files, we're going to manually set the video bitrate and resize the frames, so that the resulting file is smaller. user@ap40 $ ./ffmpeg -i input.mp4 -b:v 400k -s 640x360 output.mp4 To get the ffmpeg binary do the following: We'll be downloading the ffmpeg pre-built static binary originally from this page: http://johnvansickle.com/ffmpeg/ . user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ffmpeg-release-64bit-static.tar.xz Once the binary is downloaded, un-tar it, and then copy the main ffmpeg program into your current directory: user@ap40 $ tar -xf ffmpeg-release-64bit-static.tar.xz user@ap40 $ cp ffmpeg-4.0.1-64bit-static/ffmpeg ./","title":"Software"},{"location":"materials/data/part2-ex2-osdf-outputs/#script","text":"We want to write a script that runs on the worker node that uses ffmpeg to convert a .mov file to a smaller format. Our script will need to run the proper executable. Create a file called run_ffmpeg.sh , that does the steps described above. Use the name of the smallest .mov file in the ffmpeg command. An example of that script is below: #!/bin/bash ./ffmpeg -i test_open_terminal.mov -b:v 400k -s 640x360 test_open_terminal.mp4 Ultimately we'll want to submit several jobs (one for each .mov file), but to start with, we'll run one job to make sure that everything works. Remember to chmod +x run_ffmpeg.sh to make the script executable.","title":"Script"},{"location":"materials/data/part2-ex2-osdf-outputs/#submit-file","text":"Create a submit file for this job, based on other submit files from the school. Things to consider: We'll be copying the video file into the job's working directory from OSDF, so make sure to request enough disk space for the input mov file and the output mp4 file. If you're aren't sure how much to request, ask a helper. Add the same requirements as the previous exercise: requirements = (OSGVO_OS_STRING == \"RHEL 9\") We need to transfer the ffmpeg program that we downloaded above, and the movie from OSDF: transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mov Transfer outputs via OSDF. This requires a transfer remap: transfer_output_files = test_open_terminal.mp4 transfer_output_remaps = \"test_open_terminal.mp4 = osdf:///ospool/ap40/data/[USERNAME]/test_open_terminal.mp4\"","title":"Submit File"},{"location":"materials/data/part2-ex2-osdf-outputs/#initial-job","text":"With everything in place, submit the job. Once it finishes, we should check to make sure everything ran as expected: Check the OSDF directory. Did the output .mp4 file return? Check file sizes. How big is the returned .mp4 file? How does that compare to the original .mov input? If your job successfully returned the converted .mp4 file and did not transfer the .mov file to the submit server, and the .mp4 file was appropriately scaled down, then we can go ahead and convert all of the files we uploaded to OSDF.","title":"Initial Job"},{"location":"materials/data/part2-ex2-osdf-outputs/#multiple-jobs","text":"We wrote the name of the .mov file into our run_ffmpeg.sh executable script. To submit a set of jobs for all of our .mov files, what will we need to change in: The script? The submit file? Once you've thought about it, check your reasoning against the instructions below.","title":"Multiple jobs"},{"location":"materials/data/part2-ex2-osdf-outputs/#add-an-argument-to-your-script","text":"Look at your run_ffmpeg.sh script. What values will change for every job? The input file will change with every job - and don't forget that the output file will too! Let's make them both into arguments. To add arguments to a bash script, we use the notation $1 for the first argument (our input file) and $2 for the second argument (our output file name). The final script should look like this: #!/bin/bash ./ffmpeg -i $1 -b:v 400k -s 640x360 $2","title":"Add an argument to your script"},{"location":"materials/data/part2-ex2-osdf-outputs/#modify-your-submit-file","text":"We now need to tell each job what arguments to use. We will do this by adding an arguments line to our submit file. Because we'll only have the input file name, the \"output\" file name will be the input file name with the mp4 extension. That should look like this: arguments = $(mov) $(mov).mp4 Update the transfer_input_files to have $(mov) : transfer_input_files = ffmpeg, osdf:///ospool/ap40/data/[USERNAME]/$(mov) Similarly, update the output/remap with $(mov).mp4 : transfer_output_files = $(mov).mp4 transfer_output_remaps = \"$(mov).mp4 = osdf:///ospool/ap40/data/[USERNAME]/$(mov).mp4\" To set these arguments, we will use the queue .. from syntax. In our submit file, we can then change our queue statement to: queue mov from movie_list.txt Once you've made these changes, try submitting all the jobs!","title":"Modify your submit file"},{"location":"materials/data/part2-ex2-osdf-outputs/#bonus","text":"If you wanted to set a different output file name, bitrate and/or size for each original movie, how could you modify: movie_list.txt Your submit file run_ffmpeg.sh to do so? Show hint Here's the changes you can make to the various files: movie_list.txt ducks.mov ducks.mp4 500k 1280x720 teaching.mov teaching.mp4 400k 320x180 test_open_terminal.mov terminal.mp4 600k 640x360 Submit file arguments = $(mov) $(mp4) $(bitrate) $(size) queue mov,mp4,bitrate,size from movie_list.txt run_ffmpeg.sh 1 2 #!/bin/bash ./ffmpeg -i $1 -b:v $3 -s $4 $2","title":"Bonus"},{"location":"materials/htcondor/part1-ex1-login/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.1: Log In and Look Around \u00b6 Background \u00b6 There are different High Throughput Computing (HTC) systems at universities, government facilities, and other institutions around the world, and they may have different user experiences. For example, some systems have dedicated resources (which means your job will be guaranteed a certain amount of resources/time to complete), while other systems have opportunistic, backfill resources (which means your job can take advantage of some resources, but those resources could be removed at any time). Other systems have a mix of dedicated and opportunistic resources. Durring the OSG School, you will practice on two different HTC systems: the \" PATh Facility \" and \" OSG's Open Science Pool (OSPool) \". This will help prepare you for working on a variety of different HTC systems. PATh Facility: The PATh Facility provides researchers with dedicated HTC resources and the ability to run larger and longer jobs . The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. OSG's Open Science Pool: The OSPool provides researchers with opportunitistic resources and the ability to run many smaller and shorter jobs silmnulatinously . The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs. Exercise Goal \u00b6 The goal of this first exercise is to log in to the PATh Facility access point and look around a little bit, which will take only a few minutes. If you have trouble getting SSH access to the submit server, ask the instructors right away! Gaining access is critical for all remaining exercises. Logging In \u00b6 Today, you will use a High Throughput Computing system known as the \" PATh Facility \". The PATh Facility provides users with dedicated resources and longer runtimes than OSG's Open Science Pool. You will login to the access point of the PATh Facility, which is called ap1.facility.path-cc.io using the username you previously created. To log in, use a Secure Shell (SSH) client. From a Mac or Linux computer, start the Terminal app and run the below ssh command, replacing with your username: $ ssh @ap1.facility.path-cc.io On Windows, we recommend a free client called PuTTY , but any SSH client should be fine. If you need help finding or using an SSH client, ask the instructors for help right away ! Running Commands \u00b6 In the exercises, we will show commands that you are supposed to type or copy into the command line, like this: username@ap1 $ hostname path-ap2001 Note In the first line of the example above, the username@ap1 $ part is meant to show the Linux command-line prompt. You do not type this part! Further, your actual prompt probably is a bit different, and that is expected. So in the example above, the command that you type at your own prompt is just the eight characters hostname . The second line of the example, without the prompt, shows the output of the command; you do not type this part, either. Here are a few other commands that you can try (the examples below do not show the output from each command): username@ap1 $ whoami username@ap1 $ date username@ap1 $ uname -a A suggestion for the day: try typing into the command line as many of the commands as you can. Copy-and-paste is fine, of course, but you WILL learn more if you take the time to type each command yourself. Organizing Your Workspace \u00b6 You will be doing many different exercises over the next few days, many of them on this access point. Each exercise may use many files, once finished. To avoid confusion, it may be useful to create a separate directory for each exercise. For instance, for the rest of this exercise, you may wish to create and use a directory named intro-1.1-login , or something like that. username@ap1 $ mkdir intro-1.1-login username@ap1 $ cd intro-1.1-login Showing the Version of HTCondor \u00b6 HTCondor is installed on this server. But what version? You can ask HTCondor itself: username@ap1 $ condor_version $ CondorVersion: 23 .9.0 2024 -06-27 BuildID: 742143 PackageID: 23 .9.0-0.742143 GitSHA: 68fde429 RC $ $ CondorPlatform: x86_64_AlmaLinux8 $ As you can see from the output, we are using HTCondor 10.7.0. Reference Materials \u00b6 Here are a few links to reference materials that might be interesting after the school (or perhaps during). HTCondor manuals ; it is probably best to read the manual corresponding to the version of HTCondor that you use. That link points to the latest version of the manual, but you can switch versions using the toggle in the lower left corner of that page.","title":"1.1 - Log in and look around"},{"location":"materials/htcondor/part1-ex1-login/#htc-exercise-11-log-in-and-look-around","text":"","title":"HTC Exercise 1.1: Log In and Look Around"},{"location":"materials/htcondor/part1-ex1-login/#background","text":"There are different High Throughput Computing (HTC) systems at universities, government facilities, and other institutions around the world, and they may have different user experiences. For example, some systems have dedicated resources (which means your job will be guaranteed a certain amount of resources/time to complete), while other systems have opportunistic, backfill resources (which means your job can take advantage of some resources, but those resources could be removed at any time). Other systems have a mix of dedicated and opportunistic resources. Durring the OSG School, you will practice on two different HTC systems: the \" PATh Facility \" and \" OSG's Open Science Pool (OSPool) \". This will help prepare you for working on a variety of different HTC systems. PATh Facility: The PATh Facility provides researchers with dedicated HTC resources and the ability to run larger and longer jobs . The HTC execute pool is composed of approximately 30,000 cores and 36 A100 GPUs. OSG's Open Science Pool: The OSPool provides researchers with opportunitistic resources and the ability to run many smaller and shorter jobs silmnulatinously . The OSPool is composed of approximately 60,000+ cores and dozens of different GPUs.","title":"Background"},{"location":"materials/htcondor/part1-ex1-login/#exercise-goal","text":"The goal of this first exercise is to log in to the PATh Facility access point and look around a little bit, which will take only a few minutes. If you have trouble getting SSH access to the submit server, ask the instructors right away! Gaining access is critical for all remaining exercises.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex1-login/#logging-in","text":"Today, you will use a High Throughput Computing system known as the \" PATh Facility \". The PATh Facility provides users with dedicated resources and longer runtimes than OSG's Open Science Pool. You will login to the access point of the PATh Facility, which is called ap1.facility.path-cc.io using the username you previously created. To log in, use a Secure Shell (SSH) client. From a Mac or Linux computer, start the Terminal app and run the below ssh command, replacing with your username: $ ssh @ap1.facility.path-cc.io On Windows, we recommend a free client called PuTTY , but any SSH client should be fine. If you need help finding or using an SSH client, ask the instructors for help right away !","title":"Logging In"},{"location":"materials/htcondor/part1-ex1-login/#running-commands","text":"In the exercises, we will show commands that you are supposed to type or copy into the command line, like this: username@ap1 $ hostname path-ap2001 Note In the first line of the example above, the username@ap1 $ part is meant to show the Linux command-line prompt. You do not type this part! Further, your actual prompt probably is a bit different, and that is expected. So in the example above, the command that you type at your own prompt is just the eight characters hostname . The second line of the example, without the prompt, shows the output of the command; you do not type this part, either. Here are a few other commands that you can try (the examples below do not show the output from each command): username@ap1 $ whoami username@ap1 $ date username@ap1 $ uname -a A suggestion for the day: try typing into the command line as many of the commands as you can. Copy-and-paste is fine, of course, but you WILL learn more if you take the time to type each command yourself.","title":"Running Commands"},{"location":"materials/htcondor/part1-ex1-login/#organizing-your-workspace","text":"You will be doing many different exercises over the next few days, many of them on this access point. Each exercise may use many files, once finished. To avoid confusion, it may be useful to create a separate directory for each exercise. For instance, for the rest of this exercise, you may wish to create and use a directory named intro-1.1-login , or something like that. username@ap1 $ mkdir intro-1.1-login username@ap1 $ cd intro-1.1-login","title":"Organizing Your Workspace"},{"location":"materials/htcondor/part1-ex1-login/#showing-the-version-of-htcondor","text":"HTCondor is installed on this server. But what version? You can ask HTCondor itself: username@ap1 $ condor_version $ CondorVersion: 23 .9.0 2024 -06-27 BuildID: 742143 PackageID: 23 .9.0-0.742143 GitSHA: 68fde429 RC $ $ CondorPlatform: x86_64_AlmaLinux8 $ As you can see from the output, we are using HTCondor 10.7.0.","title":"Showing the Version of HTCondor"},{"location":"materials/htcondor/part1-ex1-login/#reference-materials","text":"Here are a few links to reference materials that might be interesting after the school (or perhaps during). HTCondor manuals ; it is probably best to read the manual corresponding to the version of HTCondor that you use. That link points to the latest version of the manual, but you can switch versions using the toggle in the lower left corner of that page.","title":"Reference Materials"},{"location":"materials/htcondor/part1-ex2-commands/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.2: Experiment With HTCondor Commands \u00b6 Exercise Goal \u00b6 The goal of this exercise is to learn about two very important HTCondor commands, condor_q and condor_status . They will be useful for monitoring your jobs and available execute point slots (respectively) throughout the week. This exercise should take only a few minutes. Viewing Slots \u00b6 As discussed in the lecture, the condor_status command is used to view the current state of slots in an HTCondor pool. At its most basic, the command is: username@ap1 $ condor_status When running this command, there is typically a lot of output printed to the screen. Looking at your terminal output, there is one line per execute point slot. TIP: You can widen your terminal window, which may help you to see all details of the output better. Here is some example output (what you see will be longer): slot1@FIU-PATH-EP.osgvo-docker-pilot-55c74f5b7c-kbs77 LINUX X86_64 Unclaimed Idle 0.000 8053 0+01:14:34 slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n LINUX X86_64 Claimed Busy 0.930 1024 0+02:42:08 slot1@WISC-PATH-EP.osgvo-docker-pilot-7b46dbdbb7-xqkkg LINUX X86_64 Claimed Busy 3.530 1024 0+02:40:24 slot1@SYRA-PATH-EP.osgvo-docker-pilot-gpu-7f6c64d459 LINUX X86_64 Owner Idle 0.300 250 7+03:22:21 This output consists of 8 columns: Col Example Meaning Name slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n Full slot name (including the hostname) OpSys LINUX Operating system Arch X86_64 Slot architecture (e.g., Intel 64 bit) State Claimed State of the slot ( Unclaimed is available, Owner is being used by the machine owner, Claimed is matched to a job) Activity Busy Is there activity on the slot? LoadAv 0.930 Load average, a measure of CPU activity on the slot Mem 1024 Memory available to the slot, in MB ActvtyTime 0+02:42:08 Amount of time spent in current activity (days + hours:minutes:seconds) At the end of the slot listing, there is a summary. Here is an example: Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 10831 0 10194 631 0 0 6 X86_64/WINDOWS 2 2 0 0 0 0 0 Total 10833 2 10194 631 0 0 6 There is one row of summary for each machine (i.e. \"slot\") architecture/operating system combination with columns for the number of slots in each state. The final row gives a summary of slot states for the whole pool. Questions: \u00b6 When you run condor_status , how many 64-bit Linux slots are available? (Hint: Unclaimed = available.) What percent of the total slots are currently claimed by a job? (Note: there is a rapid turnover of slots, which is what allows users with new submission to have jobs start quickly.) How have these numbers changed (if at all) when you run the condor_status command again? Viewing Whole Machines, Only \u00b6 Also try out the -compact for a slightly different view of whole machines (i.e. server hostnames), without the individual slots shown. username@ap1 $ condor_status -compact How has the column information changed? Viewing Jobs \u00b6 The condor_q command lists jobs that are on this access point machine and that are running or waiting to run. The _q part of the name is meant to suggest the word \u201cqueue\u201d, or list of job sets waiting to finish. Viewing Your Own Jobs \u00b6 The default behavior of the command lists only your jobs: username@ap1 $ condor_q The main part of the output (which will be empty, because you haven't submitted jobs yet) shows one set (\"batch\") of submitted jobs per line. If you had a single job in the queue, it would look something like the below: -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 09:59:31 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice CMD: run_ffmpeg.sh 7/12 09:58 _ _ 1 1 18801.0 This output consists of 8 (or 9) columns: Col Example Meaning OWNER alice The user ID of the user who submitted the job BATCH_NAME run_ffmpeg.sh The executable or \"jobbatchname\" specified within the submit file(s) SUBMITTED 7/12 09:58 The date and time when the job was submitted DONE _ Number of jobs in this batch that have completed RUN _ Number of jobs in this batch that are currently running IDLE 1 Number of jobs in this batch that are idle, waiting for a match HOLD _ Column will show up if there are jobs on \"hold\" because something about the submission/setup needs to be corrected by the user TOTAL 1 Total number of jobs in this batch JOB_IDS 18801.0 Job ID or range of Job IDs in this batch At the end of the job listing, there is a summary. Here is a sample: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended It shows total counts of jobs in the different possible states. Questions: For the sample above, when was the job submitted? For the sample above, was the job running or not yet? How can you tell? Viewing Everyone\u2019s Jobs \u00b6 By default, the condor_q command shows your jobs only. To see everyone\u2019s jobs that are queued on the machine, add the -all option: username@ap1 $ condor_q -all How many jobs are queued in total (i.e., running or waiting to run)? How many jobs from this submit machine are running right now? Viewing Jobs without the Default \"batch\" Mode \u00b6 The condor_q output, by default, groups \"batches\" of jobs together (if they were submitted with the same submit file or \"jobbatchname\"). To see more information for EVERY job on a separate line of output, use the -nobatch option to condor_q : username@ap1 $ condor_q -all -nobatch How has the column information changed? (Below is an example of the top of the output.) -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 11:58:44 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18203.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18204.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18801.0 alice 7/12 09:58 0+00:00:00 I 0 0.0 run_ffmpeg.sh 18997.0 s16_martincum 7/12 10:59 0+00:00:32 I 0 733.0 runR.pl 1_0 run_perm.R 1 0 10 19027.5 s16_martincum 7/12 11:06 0+00:09:20 I 0 2198.0 runR.pl 1_5 run_perm.R 1 5 1000 The -nobatch output shows a line for every job and consists of 8 columns: Col Example Meaning ID 18801.0 Job ID, which is the cluster , a dot character ( . ), and the process OWNER alice The user ID of the user who submitted the job SUBMITTED 7/12 09:58 The date and time when the job was submitted RUN_TIME 0+00:00:00 Total time spent running so far (days + hours:minutes:seconds) ST I Status of job: I is Idle (waiting to run), R is Running, H is Held, etc. PRI 0 Job priority (see next lecture) SIZE 0.0 Current run-time memory usage, in MB CMD run_ffmpeg.sh The executable command (with arguments) to be run In future exercises, you'll want to switch between condor_q and condor_q -nobatch to see different types of information about YOUR jobs. Extra Information \u00b6 Both condor_status and condor_q have many command-line options, some of which significantly change their output. You will explore a few of the most useful options in future exercises, but if you want to experiment now, go ahead! There are a few ways to learn more about the commands: Use the (brief) built-in help for the commands, e.g.: condor_q -h Read the installed man(ual) pages for the commands, e.g.: man condor_q Find the command in the online manual ; note: the text online is the same as the man text, only formatted for the web","title":"1.2 - Experiment with HTCondor commands"},{"location":"materials/htcondor/part1-ex2-commands/#htc-exercise-12-experiment-with-htcondor-commands","text":"","title":"HTC Exercise 1.2: Experiment With HTCondor Commands"},{"location":"materials/htcondor/part1-ex2-commands/#exercise-goal","text":"The goal of this exercise is to learn about two very important HTCondor commands, condor_q and condor_status . They will be useful for monitoring your jobs and available execute point slots (respectively) throughout the week. This exercise should take only a few minutes.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-slots","text":"As discussed in the lecture, the condor_status command is used to view the current state of slots in an HTCondor pool. At its most basic, the command is: username@ap1 $ condor_status When running this command, there is typically a lot of output printed to the screen. Looking at your terminal output, there is one line per execute point slot. TIP: You can widen your terminal window, which may help you to see all details of the output better. Here is some example output (what you see will be longer): slot1@FIU-PATH-EP.osgvo-docker-pilot-55c74f5b7c-kbs77 LINUX X86_64 Unclaimed Idle 0.000 8053 0+01:14:34 slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n LINUX X86_64 Claimed Busy 0.930 1024 0+02:42:08 slot1@WISC-PATH-EP.osgvo-docker-pilot-7b46dbdbb7-xqkkg LINUX X86_64 Claimed Busy 3.530 1024 0+02:40:24 slot1@SYRA-PATH-EP.osgvo-docker-pilot-gpu-7f6c64d459 LINUX X86_64 Owner Idle 0.300 250 7+03:22:21 This output consists of 8 columns: Col Example Meaning Name slot1@UNL-PATH-EP.osgvo-docker-pilot-9489b6b4-9rf4n Full slot name (including the hostname) OpSys LINUX Operating system Arch X86_64 Slot architecture (e.g., Intel 64 bit) State Claimed State of the slot ( Unclaimed is available, Owner is being used by the machine owner, Claimed is matched to a job) Activity Busy Is there activity on the slot? LoadAv 0.930 Load average, a measure of CPU activity on the slot Mem 1024 Memory available to the slot, in MB ActvtyTime 0+02:42:08 Amount of time spent in current activity (days + hours:minutes:seconds) At the end of the slot listing, there is a summary. Here is an example: Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 10831 0 10194 631 0 0 6 X86_64/WINDOWS 2 2 0 0 0 0 0 Total 10833 2 10194 631 0 0 6 There is one row of summary for each machine (i.e. \"slot\") architecture/operating system combination with columns for the number of slots in each state. The final row gives a summary of slot states for the whole pool.","title":"Viewing Slots"},{"location":"materials/htcondor/part1-ex2-commands/#questions","text":"When you run condor_status , how many 64-bit Linux slots are available? (Hint: Unclaimed = available.) What percent of the total slots are currently claimed by a job? (Note: there is a rapid turnover of slots, which is what allows users with new submission to have jobs start quickly.) How have these numbers changed (if at all) when you run the condor_status command again?","title":"Questions:"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-whole-machines-only","text":"Also try out the -compact for a slightly different view of whole machines (i.e. server hostnames), without the individual slots shown. username@ap1 $ condor_status -compact How has the column information changed?","title":"Viewing Whole Machines, Only"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-jobs","text":"The condor_q command lists jobs that are on this access point machine and that are running or waiting to run. The _q part of the name is meant to suggest the word \u201cqueue\u201d, or list of job sets waiting to finish.","title":"Viewing Jobs"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-your-own-jobs","text":"The default behavior of the command lists only your jobs: username@ap1 $ condor_q The main part of the output (which will be empty, because you haven't submitted jobs yet) shows one set (\"batch\") of submitted jobs per line. If you had a single job in the queue, it would look something like the below: -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 09:59:31 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice CMD: run_ffmpeg.sh 7/12 09:58 _ _ 1 1 18801.0 This output consists of 8 (or 9) columns: Col Example Meaning OWNER alice The user ID of the user who submitted the job BATCH_NAME run_ffmpeg.sh The executable or \"jobbatchname\" specified within the submit file(s) SUBMITTED 7/12 09:58 The date and time when the job was submitted DONE _ Number of jobs in this batch that have completed RUN _ Number of jobs in this batch that are currently running IDLE 1 Number of jobs in this batch that are idle, waiting for a match HOLD _ Column will show up if there are jobs on \"hold\" because something about the submission/setup needs to be corrected by the user TOTAL 1 Total number of jobs in this batch JOB_IDS 18801.0 Job ID or range of Job IDs in this batch At the end of the job listing, there is a summary. Here is a sample: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended It shows total counts of jobs in the different possible states. Questions: For the sample above, when was the job submitted? For the sample above, was the job running or not yet? How can you tell?","title":"Viewing Your Own Jobs"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-everyones-jobs","text":"By default, the condor_q command shows your jobs only. To see everyone\u2019s jobs that are queued on the machine, add the -all option: username@ap1 $ condor_q -all How many jobs are queued in total (i.e., running or waiting to run)? How many jobs from this submit machine are running right now?","title":"Viewing Everyone\u2019s Jobs"},{"location":"materials/htcondor/part1-ex2-commands/#viewing-jobs-without-the-default-batch-mode","text":"The condor_q output, by default, groups \"batches\" of jobs together (if they were submitted with the same submit file or \"jobbatchname\"). To see more information for EVERY job on a separate line of output, use the -nobatch option to condor_q : username@ap1 $ condor_q -all -nobatch How has the column information changed? (Below is an example of the top of the output.) -- Schedd: ap1.facility.path-cc.io : <128.104.100.43:9618?... @ 07/12/23 11:58:44 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18203.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18204.0 s16_alirezakho 7/11 09:51 0+00:00:00 I 0 0.7 pascal 18801.0 alice 7/12 09:58 0+00:00:00 I 0 0.0 run_ffmpeg.sh 18997.0 s16_martincum 7/12 10:59 0+00:00:32 I 0 733.0 runR.pl 1_0 run_perm.R 1 0 10 19027.5 s16_martincum 7/12 11:06 0+00:09:20 I 0 2198.0 runR.pl 1_5 run_perm.R 1 5 1000 The -nobatch output shows a line for every job and consists of 8 columns: Col Example Meaning ID 18801.0 Job ID, which is the cluster , a dot character ( . ), and the process OWNER alice The user ID of the user who submitted the job SUBMITTED 7/12 09:58 The date and time when the job was submitted RUN_TIME 0+00:00:00 Total time spent running so far (days + hours:minutes:seconds) ST I Status of job: I is Idle (waiting to run), R is Running, H is Held, etc. PRI 0 Job priority (see next lecture) SIZE 0.0 Current run-time memory usage, in MB CMD run_ffmpeg.sh The executable command (with arguments) to be run In future exercises, you'll want to switch between condor_q and condor_q -nobatch to see different types of information about YOUR jobs.","title":"Viewing Jobs without the Default \"batch\" Mode"},{"location":"materials/htcondor/part1-ex2-commands/#extra-information","text":"Both condor_status and condor_q have many command-line options, some of which significantly change their output. You will explore a few of the most useful options in future exercises, but if you want to experiment now, go ahead! There are a few ways to learn more about the commands: Use the (brief) built-in help for the commands, e.g.: condor_q -h Read the installed man(ual) pages for the commands, e.g.: man condor_q Find the command in the online manual ; note: the text online is the same as the man text, only formatted for the web","title":"Extra Information"},{"location":"materials/htcondor/part1-ex3-jobs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.3: Run Jobs! \u00b6 Exercise Goal \u00b6 The goal of this exercise is to submit jobs to HTCondor and have them run on the PATh Facility. This is a huge step in learning to use an HTC system! This exercise will take longer than the first two, short ones. If you are having any problems getting the jobs to run, please ask the instructors! It is very important that you know how to run jobs. Running Your First Job \u00b6 Nearly all of the time, when you want to run an HTCondor job, you first write an HTCondor submit file for it. In this section, you will run the same hostname command as in Exercise 1.1, but where this command will run within a job on one of the 'execute' servers on the PATh Facility's HTCondor pool. First, create an example submit file called hostname.sub using your favorite text editor (e.g., nano , vim ) and then transfer the following information to that file: executable = /bin/hostname output = hostname.out error = hostname.err log = hostname.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue Save your submit file using the name hostname.sub . Note You can name the HTCondor submit file using any filename. It's a good practice to always include the .sub extension, but it is not required. This is because the submit file is a simple text file that we are using to pass information to HTCondor. The lines of the submit file have the following meanings: Submit Command Explanation executable The name of the program to run (relative to the directory from which you submit). output The filename where HTCondor will write the standard output from your job. error The filename where HTCondor will write the standard error from your job. This particular job is not likely to have any, but it is best to include this line for every job. log The filename where HTCondor will write information about your job run. While not required, it is a really good idea to have a log file for every job. request_* Tells HTCondor how many cpus and how much memory and disk we want, which is not much, because the 'hostname' executable is very small. queue Tells HTCondor to run your job with the settings above. Note that we are not using the arguments or transfer_input_files lines that were mentioned during lecture because the hostname program is all that needs to be transferred from the access point server, and we want to run it without any additional options. Double-check your submit file, so that it matches the text above. Then, tell HTCondor to run your job: username@ap1 $ condor_submit hostname.sub Submitting job(s). 1 job(s) submitted to cluster NNNN. The actual cluster number will be shown instead of NNNN . If, instead of the text above, there are error messages, read them carefully and then try to correct your submit file or ask for help. Notice that condor_submit returns back to the shell prompt right away. It does not wait for your job to run. Instead, as soon as it has finished submitting your job into the queue, the submit command finishes. View your job in the queue \u00b6 Now, use condor_q and condor_q -nobatch to watch for your job in the queue! You may not even catch the job in the R running state, because the hostname command runs very quickly. When the job itself is finished, it will 'leave' the queue and no longer be listed in the condor_q output. After the job finishes, check for the hostname output in hostname.out , which is where job information printed to the terminal screen will be printed for the job. username@ap1 $ cat hostname.out e171.chtc.wisc.edu The hostname.err file should be empty, unless there were issues running the hostname executable after it was transferred to the slot. The hostname.log is more complex and will be the focus of a later exercise. Running a Job With Arguments \u00b6 Very often, when you run a command on the command line, it includes arguments (i.e. options) after the program name, as in the below examples: username@ap1 $ sleep 60 In an HTCondor submit file, the program (or 'executable') name goes in the executable statement and all remaining arguments go into an arguments statement. For example, if the full command is: username@ap1 $ sleep 60 Then in the submit file, we would put the location of the \"sleep\" program (you can find it with which sleep ) as the job executable , and 60 as the job arguments : executable = /bin/sleep arguments = 60 Let\u2019s try a job submission with arguments. We will use the sleep command shown above, which does nothing (i.e., puts the job to sleep) for the specified number of seconds, then exits normally. It is convenient for simulating a job that takes a while to run. Create a new submit file and save the following text in it. executable = /bin/sleep arguments = 60 output = sleep.out error = sleep.err log = sleep.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue You can save the file using any name, but as a reminder, we recommend it uses the .sub file extension. Except for changing a few filenames, this submit file is nearly identical to the last one, except for the addition of the arguments line. Submit this new job to HTCondor. Again, watch for it to run using condor_q and condor_q -nobatch ; check once every 15 seconds or so. Once the job starts running, it will take about 1 minute to run (reminder: the sleep command is telling the job to do nothing for 60 seconds), so you should be able to see it running for a bit. When the job finishes, it will disappear from the queue, but there will be no output in the output or error files, because sleep does not produce any output. Running a Script Job From the Submit Directory \u00b6 So far, we have been running programs (executables) that come with the standard Linux system. More frequently, you will want to run a program that exists within your directory or perhaps a shell script of commands that you'd like to run within a job. In this example, you will write a shell script and a submit file that runs the shell script within a job: Put the following contents into a file named test-script.sh : #!/bin/sh # START echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'System: ' ` uname -spo ` echo \"Program: $0 \" echo \"Args: $* \" echo 'ls: ' ` ls ` # END Add executable permissions to the file (so that it can be run as a program): username@ap1 $ chmod +x test-script.sh Test your script from the command line: username@ap1 $ ./test-script.sh hello 42 Date: Mon Jul 1 14:03:56 CDT 2024 Host: path-ap2001 System: Linux x86_64 GNU/Linux Program: ./test-script.sh Args: hello 42 ls: hostname.err hostname.log hostname.out hostname.sub sleep.log sleep.sub test-script.sh This step is really important! If you cannot run your executable from the command-line, HTCondor probably cannot run it on another machine, either. Further, debugging problems like this one is surprisingly difficult. So, if possible, test your executable and arguments as a command at the command-line first. Write the submit file (this should be getting easier by now): executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue In this example, the executable that was named in the submit file did not start with a / , so the location of the file is relative to the submit directory itself. In other words, in this format the executable must be in the same directory as the submit file. Note Blank lines between commands and spaces around the = do not matter to HTCondor. For example, this submit file is equivalent to the one above: executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus=1 request_memory=1GB request_disk=1GB queue Use whitespace to make things clear to you , the user. Submit the job, wait for it to finish, and check the standard output file (and standard error file, which should be empty). What do you notice about the lines returned for \"Program\" and \"ls\"? Remember that only files pertaining to this job will be in the job working directory on the execute point server. You're also seeing the effects of HTCondor's need to standardize some filenames when running your job, though they are named as you expect in the submission directory (per the submit file contents). Extra Challenge \u00b6 Note There are Extra Challenges throughout the school curriculum. You may be better off coming back to these after you've completed all other exercises for your current working session. Below is a Python script that does something similar to the shell script above. Run this Python script using HTCondor. #!/usr/bin/env python3 \"\"\"Extra Challenge for OSG School Written by Tim Cartwright Submitted to CHTC by #YOUR_NAME# \"\"\" import getpass import os import platform import socket import sys import time arguments = None if len ( sys . argv ) > 1 : arguments = '\"' + ' ' . join ( sys . argv [ 1 :]) + '\"' print ( __doc__ , file = sys . stderr ) print ( 'Time :' , time . strftime ( '%Y-%m- %d ( %a ) %H:%M:%S %Z' )) print ( 'Host :' , getpass . getuser (), '@' , socket . gethostname ()) uname = platform . uname () print ( \"System :\" , uname [ 0 ], uname [ 2 ], uname [ 4 ]) print ( \"Version :\" , platform . python_version ()) print ( \"Program :\" , sys . executable ) print ( 'Script :' , os . path . abspath ( __file__ )) print ( 'Args :' , arguments )","title":"1.3 - Run jobs!"},{"location":"materials/htcondor/part1-ex3-jobs/#htc-exercise-13-run-jobs","text":"","title":"HTC Exercise 1.3: Run Jobs!"},{"location":"materials/htcondor/part1-ex3-jobs/#exercise-goal","text":"The goal of this exercise is to submit jobs to HTCondor and have them run on the PATh Facility. This is a huge step in learning to use an HTC system! This exercise will take longer than the first two, short ones. If you are having any problems getting the jobs to run, please ask the instructors! It is very important that you know how to run jobs.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex3-jobs/#running-your-first-job","text":"Nearly all of the time, when you want to run an HTCondor job, you first write an HTCondor submit file for it. In this section, you will run the same hostname command as in Exercise 1.1, but where this command will run within a job on one of the 'execute' servers on the PATh Facility's HTCondor pool. First, create an example submit file called hostname.sub using your favorite text editor (e.g., nano , vim ) and then transfer the following information to that file: executable = /bin/hostname output = hostname.out error = hostname.err log = hostname.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue Save your submit file using the name hostname.sub . Note You can name the HTCondor submit file using any filename. It's a good practice to always include the .sub extension, but it is not required. This is because the submit file is a simple text file that we are using to pass information to HTCondor. The lines of the submit file have the following meanings: Submit Command Explanation executable The name of the program to run (relative to the directory from which you submit). output The filename where HTCondor will write the standard output from your job. error The filename where HTCondor will write the standard error from your job. This particular job is not likely to have any, but it is best to include this line for every job. log The filename where HTCondor will write information about your job run. While not required, it is a really good idea to have a log file for every job. request_* Tells HTCondor how many cpus and how much memory and disk we want, which is not much, because the 'hostname' executable is very small. queue Tells HTCondor to run your job with the settings above. Note that we are not using the arguments or transfer_input_files lines that were mentioned during lecture because the hostname program is all that needs to be transferred from the access point server, and we want to run it without any additional options. Double-check your submit file, so that it matches the text above. Then, tell HTCondor to run your job: username@ap1 $ condor_submit hostname.sub Submitting job(s). 1 job(s) submitted to cluster NNNN. The actual cluster number will be shown instead of NNNN . If, instead of the text above, there are error messages, read them carefully and then try to correct your submit file or ask for help. Notice that condor_submit returns back to the shell prompt right away. It does not wait for your job to run. Instead, as soon as it has finished submitting your job into the queue, the submit command finishes.","title":"Running Your First Job"},{"location":"materials/htcondor/part1-ex3-jobs/#view-your-job-in-the-queue","text":"Now, use condor_q and condor_q -nobatch to watch for your job in the queue! You may not even catch the job in the R running state, because the hostname command runs very quickly. When the job itself is finished, it will 'leave' the queue and no longer be listed in the condor_q output. After the job finishes, check for the hostname output in hostname.out , which is where job information printed to the terminal screen will be printed for the job. username@ap1 $ cat hostname.out e171.chtc.wisc.edu The hostname.err file should be empty, unless there were issues running the hostname executable after it was transferred to the slot. The hostname.log is more complex and will be the focus of a later exercise.","title":"View your job in the queue"},{"location":"materials/htcondor/part1-ex3-jobs/#running-a-job-with-arguments","text":"Very often, when you run a command on the command line, it includes arguments (i.e. options) after the program name, as in the below examples: username@ap1 $ sleep 60 In an HTCondor submit file, the program (or 'executable') name goes in the executable statement and all remaining arguments go into an arguments statement. For example, if the full command is: username@ap1 $ sleep 60 Then in the submit file, we would put the location of the \"sleep\" program (you can find it with which sleep ) as the job executable , and 60 as the job arguments : executable = /bin/sleep arguments = 60 Let\u2019s try a job submission with arguments. We will use the sleep command shown above, which does nothing (i.e., puts the job to sleep) for the specified number of seconds, then exits normally. It is convenient for simulating a job that takes a while to run. Create a new submit file and save the following text in it. executable = /bin/sleep arguments = 60 output = sleep.out error = sleep.err log = sleep.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue You can save the file using any name, but as a reminder, we recommend it uses the .sub file extension. Except for changing a few filenames, this submit file is nearly identical to the last one, except for the addition of the arguments line. Submit this new job to HTCondor. Again, watch for it to run using condor_q and condor_q -nobatch ; check once every 15 seconds or so. Once the job starts running, it will take about 1 minute to run (reminder: the sleep command is telling the job to do nothing for 60 seconds), so you should be able to see it running for a bit. When the job finishes, it will disappear from the queue, but there will be no output in the output or error files, because sleep does not produce any output.","title":"Running a Job With Arguments"},{"location":"materials/htcondor/part1-ex3-jobs/#running-a-script-job-from-the-submit-directory","text":"So far, we have been running programs (executables) that come with the standard Linux system. More frequently, you will want to run a program that exists within your directory or perhaps a shell script of commands that you'd like to run within a job. In this example, you will write a shell script and a submit file that runs the shell script within a job: Put the following contents into a file named test-script.sh : #!/bin/sh # START echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'System: ' ` uname -spo ` echo \"Program: $0 \" echo \"Args: $* \" echo 'ls: ' ` ls ` # END Add executable permissions to the file (so that it can be run as a program): username@ap1 $ chmod +x test-script.sh Test your script from the command line: username@ap1 $ ./test-script.sh hello 42 Date: Mon Jul 1 14:03:56 CDT 2024 Host: path-ap2001 System: Linux x86_64 GNU/Linux Program: ./test-script.sh Args: hello 42 ls: hostname.err hostname.log hostname.out hostname.sub sleep.log sleep.sub test-script.sh This step is really important! If you cannot run your executable from the command-line, HTCondor probably cannot run it on another machine, either. Further, debugging problems like this one is surprisingly difficult. So, if possible, test your executable and arguments as a command at the command-line first. Write the submit file (this should be getting easier by now): executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus = 1 request_memory = 1GB request_disk = 1GB queue In this example, the executable that was named in the submit file did not start with a / , so the location of the file is relative to the submit directory itself. In other words, in this format the executable must be in the same directory as the submit file. Note Blank lines between commands and spaces around the = do not matter to HTCondor. For example, this submit file is equivalent to the one above: executable = test-script.sh arguments = foo bar baz output = script.out error = script.err log = script.log request_cpus=1 request_memory=1GB request_disk=1GB queue Use whitespace to make things clear to you , the user. Submit the job, wait for it to finish, and check the standard output file (and standard error file, which should be empty). What do you notice about the lines returned for \"Program\" and \"ls\"? Remember that only files pertaining to this job will be in the job working directory on the execute point server. You're also seeing the effects of HTCondor's need to standardize some filenames when running your job, though they are named as you expect in the submission directory (per the submit file contents).","title":"Running a Script Job From the Submit Directory"},{"location":"materials/htcondor/part1-ex3-jobs/#extra-challenge","text":"Note There are Extra Challenges throughout the school curriculum. You may be better off coming back to these after you've completed all other exercises for your current working session. Below is a Python script that does something similar to the shell script above. Run this Python script using HTCondor. #!/usr/bin/env python3 \"\"\"Extra Challenge for OSG School Written by Tim Cartwright Submitted to CHTC by #YOUR_NAME# \"\"\" import getpass import os import platform import socket import sys import time arguments = None if len ( sys . argv ) > 1 : arguments = '\"' + ' ' . join ( sys . argv [ 1 :]) + '\"' print ( __doc__ , file = sys . stderr ) print ( 'Time :' , time . strftime ( '%Y-%m- %d ( %a ) %H:%M:%S %Z' )) print ( 'Host :' , getpass . getuser (), '@' , socket . gethostname ()) uname = platform . uname () print ( \"System :\" , uname [ 0 ], uname [ 2 ], uname [ 4 ]) print ( \"Version :\" , platform . python_version ()) print ( \"Program :\" , sys . executable ) print ( 'Script :' , os . path . abspath ( __file__ )) print ( 'Args :' , arguments )","title":"Extra Challenge"},{"location":"materials/htcondor/part1-ex4-logs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.4: Read and Interpret Log Files \u00b6 Exercise Goal \u00b6 The goal of this exercise is to learn how to understand the contents of a job's log file, which is essentially a \"history\" of the steps HTCondor took to run your job. If you suspect something has gone wrong with your job, the log is the a great place to start looking for indications of whether things might have gone wrong (in addition to the .err file). This exercise is short, but you'll want to at least read over it before moving on. Reading a Log File \u00b6 For this exercise, we can examine a log file for any previous job that you have run. The example output below is based on the sleep 60 job. A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are all of the event headings from the sleep job log (detailed output in between headings has been omitted here): 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> 040 (5739.000.000) 2024-07-10 10:45:10 Started transferring input files 040 (5739.000.000) 2024-07-10 10:45:10 Finished transferring input files 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 040 (5739.000.000) 2024-07-10 10:45:20 Started transferring output files 040 (5739.000.000) 2024-07-10 10:45:20 Finished transferring output files 006 (5739.000.000) 2024-07-10 10:46:11 Image size of job updated: 4072 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. There is a lot of extra information in those lines, but you can see: The job ID: cluster 5739, process 0 (written 000 ) The date and local time of each event A brief description of the event: submission, execution, some information updates, and termination Some events provide no information in addition to the heading. For example: 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> ... Note Each event ends with a line that contains only 3 dots: ... However, some lines have additional information to help you quickly understand where and how your job is running. For example: 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> SlotName: slot1@WISC-PATH-IDPL-EP.osgvo-docker-pilot-idpl-7c6575d494-2sj5w CondorScratchDir = \"/pilot/osgvo-pilot-2q71K9/execute/dir_9316\" Cpus = 1 Disk = 174321444 GLIDEIN_ResourceName = \"WISC-PATH-IDPL-EP\" GPUs = 0 Memory = 8192 ... The SlotName is the name of the execution point slot your job was assigned to by HTCondor, and the name of the execution point resource is provided in GLIDEIN_ResourceName The CondorScratchDir is the name of the scratch directory that was created by HTCondor for your job to run inside The Cpu , GPUs , Disk , and Memory values provide the maximum amount of each resource your job has used while running Another example of is the periodic update: 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 1 - MemoryUsage of job (MB) 72 - ResidentSetSize of job (KB) ... These updates record the amount of memory that the job is using on the execute machine. This can be helpful information, so that in future runs of the job, you can tell HTCondor how much memory you will need. The job termination event includes a lot of very useful information: 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 27848 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 27848 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 40 30 4203309 Memory (MB) : 1 1 1 Job terminated of its own accord at 2024-07-10 10:46:11 with exit-code 0. ... Probably the most interesting information is: The return value or exit code ( 0 here, means the executable completed and didn't indicate any internal errors; non-zero usually means failure) The total number of bytes transferred each way, which could be useful if your network is slow The Partitionable Resources table, especially disk and memory usage, which will inform larger submissions. There are many other kinds of events, but the ones above will occur in almost every job log. Understanding When Job Log Events Are Written \u00b6 When are events written to the job log file? Let\u2019s find out. Read through the entire procedure below before starting, because some parts of the process are time sensitive. Change the sleep job submit file, so that the job sleeps for 2 minutes (= 120 seconds) Submit the updated sleep job As soon as the condor_submit command finishes, hit the return key a few times, to create some blank lines Right away, run a command to show the log file and keep showing updates as they occur: username@ap1 $ tail -f sleep.log Watch the output carefully. When do events appear in the log file? After the termination event appears, press Control-C to end the tail command and return to the shell prompt. Understanding How HTCondor Writes Files \u00b6 When HTCondor writes the output, error, and log files, does it erase the previous contents of the file or does it add new lines onto the end? Let\u2019s find out! For this exercise, we can use the hostname job from earlier. Edit the hostname submit file so that it uses new and unique filenames for output, error, and log files. Alternatively, delete any existing output, error, and log files from previous runs of the hostname job. Submit the job three separate times in a row (there are better ways to do this, which we will cover in the next lecture) Wait for all three jobs to finish Examine the output file: How many hostnames are there? Did HTCondor erase the previous contents for each job, or add new lines? Examine the log file\u2026 carefully: What happened there? Pay close attention to the times and job IDs of the events. For further clarification about how HTCondor handles these files, reach out to your mentor or one of the other school staff.","title":"1.4 - Read and interpret log files"},{"location":"materials/htcondor/part1-ex4-logs/#htc-exercise-14-read-and-interpret-log-files","text":"","title":"HTC Exercise 1.4: Read and Interpret Log Files"},{"location":"materials/htcondor/part1-ex4-logs/#exercise-goal","text":"The goal of this exercise is to learn how to understand the contents of a job's log file, which is essentially a \"history\" of the steps HTCondor took to run your job. If you suspect something has gone wrong with your job, the log is the a great place to start looking for indications of whether things might have gone wrong (in addition to the .err file). This exercise is short, but you'll want to at least read over it before moving on.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex4-logs/#reading-a-log-file","text":"For this exercise, we can examine a log file for any previous job that you have run. The example output below is based on the sleep 60 job. A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are all of the event headings from the sleep job log (detailed output in between headings has been omitted here): 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> 040 (5739.000.000) 2024-07-10 10:45:10 Started transferring input files 040 (5739.000.000) 2024-07-10 10:45:10 Finished transferring input files 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 040 (5739.000.000) 2024-07-10 10:45:20 Started transferring output files 040 (5739.000.000) 2024-07-10 10:45:20 Finished transferring output files 006 (5739.000.000) 2024-07-10 10:46:11 Image size of job updated: 4072 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. There is a lot of extra information in those lines, but you can see: The job ID: cluster 5739, process 0 (written 000 ) The date and local time of each event A brief description of the event: submission, execution, some information updates, and termination Some events provide no information in addition to the heading. For example: 000 (5739.000.000) 2024-07-10 10:44:20 Job submitted from host: <128.104.100.43:9618?addrs=...> ... Note Each event ends with a line that contains only 3 dots: ... However, some lines have additional information to help you quickly understand where and how your job is running. For example: 001 (5739.000.000) 2024-07-10 10:45:11 Job executing on host: <128.104.55.42:9618?addrs=...> SlotName: slot1@WISC-PATH-IDPL-EP.osgvo-docker-pilot-idpl-7c6575d494-2sj5w CondorScratchDir = \"/pilot/osgvo-pilot-2q71K9/execute/dir_9316\" Cpus = 1 Disk = 174321444 GLIDEIN_ResourceName = \"WISC-PATH-IDPL-EP\" GPUs = 0 Memory = 8192 ... The SlotName is the name of the execution point slot your job was assigned to by HTCondor, and the name of the execution point resource is provided in GLIDEIN_ResourceName The CondorScratchDir is the name of the scratch directory that was created by HTCondor for your job to run inside The Cpu , GPUs , Disk , and Memory values provide the maximum amount of each resource your job has used while running Another example of is the periodic update: 006 (5739.000.000) 2024-07-10 10:45:20 Image size of job updated: 72 1 - MemoryUsage of job (MB) 72 - ResidentSetSize of job (KB) ... These updates record the amount of memory that the job is using on the execute machine. This can be helpful information, so that in future runs of the job, you can tell HTCondor how much memory you will need. The job termination event includes a lot of very useful information: 005 (5739.000.000) 2024-07-10 10:46:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 27848 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 27848 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 40 30 4203309 Memory (MB) : 1 1 1 Job terminated of its own accord at 2024-07-10 10:46:11 with exit-code 0. ... Probably the most interesting information is: The return value or exit code ( 0 here, means the executable completed and didn't indicate any internal errors; non-zero usually means failure) The total number of bytes transferred each way, which could be useful if your network is slow The Partitionable Resources table, especially disk and memory usage, which will inform larger submissions. There are many other kinds of events, but the ones above will occur in almost every job log.","title":"Reading a Log File"},{"location":"materials/htcondor/part1-ex4-logs/#understanding-when-job-log-events-are-written","text":"When are events written to the job log file? Let\u2019s find out. Read through the entire procedure below before starting, because some parts of the process are time sensitive. Change the sleep job submit file, so that the job sleeps for 2 minutes (= 120 seconds) Submit the updated sleep job As soon as the condor_submit command finishes, hit the return key a few times, to create some blank lines Right away, run a command to show the log file and keep showing updates as they occur: username@ap1 $ tail -f sleep.log Watch the output carefully. When do events appear in the log file? After the termination event appears, press Control-C to end the tail command and return to the shell prompt.","title":"Understanding When Job Log Events Are Written"},{"location":"materials/htcondor/part1-ex4-logs/#understanding-how-htcondor-writes-files","text":"When HTCondor writes the output, error, and log files, does it erase the previous contents of the file or does it add new lines onto the end? Let\u2019s find out! For this exercise, we can use the hostname job from earlier. Edit the hostname submit file so that it uses new and unique filenames for output, error, and log files. Alternatively, delete any existing output, error, and log files from previous runs of the hostname job. Submit the job three separate times in a row (there are better ways to do this, which we will cover in the next lecture) Wait for all three jobs to finish Examine the output file: How many hostnames are there? Did HTCondor erase the previous contents for each job, or add new lines? Examine the log file\u2026 carefully: What happened there? Pay close attention to the times and job IDs of the events. For further clarification about how HTCondor handles these files, reach out to your mentor or one of the other school staff.","title":"Understanding How HTCondor Writes Files"},{"location":"materials/htcondor/part1-ex5-request/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.5: Declare Resource Needs \u00b6 The goal of this exercise is to demonstrate how to test and tune the request_X statements in a submit file for when you don't know what resources your job needs. There are three special resource request statements that you can use (optionally) in an HTCondor submit file: request_cpus for the number of CPUs your job will use. A value of \"1\" is always a great starting point, but some software can use more than \"1\" (however, most softwares will use an argument to control this number). request_memory for the maximum amount of run-time memory your job may use. request_disk for the maximum amount of disk space your job may use (including the executable and all other data that may show up during the job). HTCondor defaults to certain reasonable values for these request settings, so you do not need to use them to get small jobs to run. However, it is in YOUR best interest to always estimate resource requests before submitting any job, and to definitely tune your requests before submitting multiple jobs. In many HTCondor pools: If your job goes over the request values, it may be removed from the execute machine and held (status 'H' in the condor_q output, awaiting action on your part) without saving any partial job output files. So it is a disadvantage to not declare your resource needs or if you underestimate them. Conversely, if you overestimate them by too much, your jobs will match to fewer slots and take longer to match to a slot to begin running. Additionally, by hogging up resources that you don't need, other users may be deprived of the resources they require. In the long run, it works better for all users of the pool if you declare what you really need. But how do you know what to request? In particular, we are concerned with memory and disk here; requesting multiple CPUs and using them is covered a bit in later school materials, but true HTC splits work up into jobs that each use as few CPU cores as possible (one CPU core is always best to have the most jobs running). Determining Resource Needs Before Running Any Jobs \u00b6 Note If you are running short on time, you can skip to \"Determining Resource Needs By Running Test Jobs\", below, but try to come back and read over this part at some point. It can be very difficult to predict the memory needs of your running program without running tests. Typically, the memory size of a job changes over time, making the task even trickier. If you have knowledge ahead of time about your job\u2019s maximum memory needs, use that, or maybe a number that's just a bit higher, to ensure your job has enough memory to complete. If this is your first time running your job, you can request a fairly large amount of memory (as high as what's on your laptop or other server, if you know your program can run without crashing) for a first test job, OR you can run the program locally and \"watch\" it: Examining a Running Program on a Local Computer \u00b6 When working on a shared access point, you should not run computationally-intensive work because it can use resources needed by HTCondor to manage the queue for all uses. However, you may have access to other computers (your laptop, for example, or another server) where you can observe the memory usage of a program. The downside is that you'll have to watch a program run for essentially the entire time, to make sure you catch the maximum memory usage. For Memory: \u00b6 On Mac and Windows, for example, the \"Activity Monitor\" and \"Task Manager\" applications may be useful. On a Mac or Linux system, you can use the ps command or the top command in the Terminal to watch a running program and see (roughly) how much memory it is using. Full coverage of these tools is beyond the scope of this exercise, but here are two quick examples: Using ps : username@ap1 $ ps ux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND alice 24342 0.0 0.0 90224 1864 ? S 13:39 0:00 sshd: alice@pts/0 alice 24343 0.0 0.0 66096 1580 pts/0 Ss 13:39 0:00 -bash alice 25864 0.0 0.0 65624 996 pts/0 R+ 13:52 0:00 ps ux alice 30052 0.0 0.0 90720 2456 ? S Jun22 0:00 sshd: alice@pts/2 alice 30053 0.0 0.0 66096 1624 pts/2 Ss+ Jun22 0:00 -bash The Resident Set Size ( RSS ) column, highlighted above, gives a rough indication of the memory usage (in KB) of each running process. If your program runs long enough, you can run this command several times and note the greatest value. Using top : username@ap1 $ top -u top - 13:55:31 up 11 days, 20:59, 5 users, load average: 0.12, 0.12, 0.09 Tasks: 198 total, 1 running, 197 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 0.1%sy, 0.0%ni, 98.5%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 4001440k total, 3558028k used, 443412k free, 258568k buffers Swap: 4194296k total, 148k used, 4194148k free, 2960760k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24342 alice 15 0 90224 1864 1096 S 0.0 0.0 0:00.26 sshd 24343 alice 15 0 66096 1580 1232 S 0.0 0.0 0:00.07 bash 25927 alice 15 0 12760 1196 836 R 0.0 0.0 0:00.01 top 30052 alice 16 0 90720 2456 1112 S 0.0 0.1 0:00.69 sshd 30053 alice 18 0 66096 1624 1236 S 0.0 0.0 0:00.37 bash The top command (shown here with an option to limit the output to a single user ID) also shows information about running processes, but updates periodically by itself. Type the letter q to quit the interactive display. Again, the highlighted RES column shows an approximation of memory usage. For Disk: \u00b6 Determining disk needs may be a bit easier, because you can check on the size of files that a program is using while it runs. However, it is important to count all files that HTCondor counts to get an accurate size. HTCondor counts everything in your job sandbox toward your job\u2019s disk usage: The executable itself All \"input\" files (anything else that gets transferred TO the job, even if you don't think of it as \"input\") All files created during the job (broadly defined as \"output\"), including the captured standard output and error files that you list in the submit file. All temporary files created in the sandbox, even if they get deleted by the executable before it's done. If you can run your program within a single directory on a local computer (not on the access point), you should be able to view files and their sizes with the ls and du commands. Determining Resource Needs By Running Test Jobs (BEST) \u00b6 Despite the techniques mentioned above, by far the easiest approach to measuring your job\u2019s resource needs is to run one or a small number of sample jobs and have HTCondor itself tell you about the resources used during the runs. For example, here is a strange Python script that does not do anything useful, but consumes some real resources while running: #!/usr/bin/env python3 import time import os size = 1000000 numbers = [] for i in range ( size ): numbers . append ( str ( i )) with open ( 'numbers.txt' , 'w' ) as tempfile : tempfile . write ( ' ' . join ( numbers )) time . sleep ( 60 ) Without trying to figure out what this code does or how many resources it uses, create a submit file for it, and run it once with HTCondor, starting with somewhat high memory requests (\"1GB\" for memory and disk is a good starting point, unless you think the job will use far more). When it is done, examine the log file. In particular, we care about these lines: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 6739 1048576 8022934 Memory (MB) : 3 1024 1024 So, now we know that HTCondor saw that the job used 6,739 KB of disk (= about 6.5 MB) and 3 MB of memory! This is a great technique for determining the real resource needs of your job. If you think resource needs vary from run to run, submit a few sample jobs and look at all the results. You should round up your resource requests a little, just in case your job occasionally uses more resources. Setting Resource Requirements \u00b6 Once you know your job\u2019s resource requirements, it is easy to declare them in your submit file. For example, taking our results above as an example, we might slightly increase our requests above what was used, just to be safe: # rounded up from 3 MB request_memory = 4MB # rounded up from 6.5 MB request_disk = 7MB Pay close attention to units: Without explicit units, request_memory is in MB (megabytes) Without explicit units, request_disk is in KB (kilobytes) Allowable units are KB (kilobytes), MB (megabytes), GB (gigabytes), and TB (terabytes) HTCondor translates these requirements into attributes that become part of the job's requirements expression. However, do not put your CPU, memory, and disk requirements directly into the requirements expression; use the request_XXX statements instead. If you still have time in this working session, Add these requirements to your submit file for the Python script, rerun the job, and confirm in the log file that your requests were used. After changing the requirements in your submit file, did your job run successfully? If not, why? (Hint: HTCondor polls a job's resource use on a timer. How long are these jobs running for?)","title":"1.5 - Determining resource needs"},{"location":"materials/htcondor/part1-ex5-request/#htc-exercise-15-declare-resource-needs","text":"The goal of this exercise is to demonstrate how to test and tune the request_X statements in a submit file for when you don't know what resources your job needs. There are three special resource request statements that you can use (optionally) in an HTCondor submit file: request_cpus for the number of CPUs your job will use. A value of \"1\" is always a great starting point, but some software can use more than \"1\" (however, most softwares will use an argument to control this number). request_memory for the maximum amount of run-time memory your job may use. request_disk for the maximum amount of disk space your job may use (including the executable and all other data that may show up during the job). HTCondor defaults to certain reasonable values for these request settings, so you do not need to use them to get small jobs to run. However, it is in YOUR best interest to always estimate resource requests before submitting any job, and to definitely tune your requests before submitting multiple jobs. In many HTCondor pools: If your job goes over the request values, it may be removed from the execute machine and held (status 'H' in the condor_q output, awaiting action on your part) without saving any partial job output files. So it is a disadvantage to not declare your resource needs or if you underestimate them. Conversely, if you overestimate them by too much, your jobs will match to fewer slots and take longer to match to a slot to begin running. Additionally, by hogging up resources that you don't need, other users may be deprived of the resources they require. In the long run, it works better for all users of the pool if you declare what you really need. But how do you know what to request? In particular, we are concerned with memory and disk here; requesting multiple CPUs and using them is covered a bit in later school materials, but true HTC splits work up into jobs that each use as few CPU cores as possible (one CPU core is always best to have the most jobs running).","title":"HTC Exercise 1.5: Declare Resource Needs"},{"location":"materials/htcondor/part1-ex5-request/#determining-resource-needs-before-running-any-jobs","text":"Note If you are running short on time, you can skip to \"Determining Resource Needs By Running Test Jobs\", below, but try to come back and read over this part at some point. It can be very difficult to predict the memory needs of your running program without running tests. Typically, the memory size of a job changes over time, making the task even trickier. If you have knowledge ahead of time about your job\u2019s maximum memory needs, use that, or maybe a number that's just a bit higher, to ensure your job has enough memory to complete. If this is your first time running your job, you can request a fairly large amount of memory (as high as what's on your laptop or other server, if you know your program can run without crashing) for a first test job, OR you can run the program locally and \"watch\" it:","title":"Determining Resource Needs Before Running Any Jobs"},{"location":"materials/htcondor/part1-ex5-request/#examining-a-running-program-on-a-local-computer","text":"When working on a shared access point, you should not run computationally-intensive work because it can use resources needed by HTCondor to manage the queue for all uses. However, you may have access to other computers (your laptop, for example, or another server) where you can observe the memory usage of a program. The downside is that you'll have to watch a program run for essentially the entire time, to make sure you catch the maximum memory usage.","title":"Examining a Running Program on a Local Computer"},{"location":"materials/htcondor/part1-ex5-request/#for-memory","text":"On Mac and Windows, for example, the \"Activity Monitor\" and \"Task Manager\" applications may be useful. On a Mac or Linux system, you can use the ps command or the top command in the Terminal to watch a running program and see (roughly) how much memory it is using. Full coverage of these tools is beyond the scope of this exercise, but here are two quick examples: Using ps : username@ap1 $ ps ux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND alice 24342 0.0 0.0 90224 1864 ? S 13:39 0:00 sshd: alice@pts/0 alice 24343 0.0 0.0 66096 1580 pts/0 Ss 13:39 0:00 -bash alice 25864 0.0 0.0 65624 996 pts/0 R+ 13:52 0:00 ps ux alice 30052 0.0 0.0 90720 2456 ? S Jun22 0:00 sshd: alice@pts/2 alice 30053 0.0 0.0 66096 1624 pts/2 Ss+ Jun22 0:00 -bash The Resident Set Size ( RSS ) column, highlighted above, gives a rough indication of the memory usage (in KB) of each running process. If your program runs long enough, you can run this command several times and note the greatest value. Using top : username@ap1 $ top -u top - 13:55:31 up 11 days, 20:59, 5 users, load average: 0.12, 0.12, 0.09 Tasks: 198 total, 1 running, 197 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 0.1%sy, 0.0%ni, 98.5%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 4001440k total, 3558028k used, 443412k free, 258568k buffers Swap: 4194296k total, 148k used, 4194148k free, 2960760k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24342 alice 15 0 90224 1864 1096 S 0.0 0.0 0:00.26 sshd 24343 alice 15 0 66096 1580 1232 S 0.0 0.0 0:00.07 bash 25927 alice 15 0 12760 1196 836 R 0.0 0.0 0:00.01 top 30052 alice 16 0 90720 2456 1112 S 0.0 0.1 0:00.69 sshd 30053 alice 18 0 66096 1624 1236 S 0.0 0.0 0:00.37 bash The top command (shown here with an option to limit the output to a single user ID) also shows information about running processes, but updates periodically by itself. Type the letter q to quit the interactive display. Again, the highlighted RES column shows an approximation of memory usage.","title":"For Memory:"},{"location":"materials/htcondor/part1-ex5-request/#for-disk","text":"Determining disk needs may be a bit easier, because you can check on the size of files that a program is using while it runs. However, it is important to count all files that HTCondor counts to get an accurate size. HTCondor counts everything in your job sandbox toward your job\u2019s disk usage: The executable itself All \"input\" files (anything else that gets transferred TO the job, even if you don't think of it as \"input\") All files created during the job (broadly defined as \"output\"), including the captured standard output and error files that you list in the submit file. All temporary files created in the sandbox, even if they get deleted by the executable before it's done. If you can run your program within a single directory on a local computer (not on the access point), you should be able to view files and their sizes with the ls and du commands.","title":"For Disk:"},{"location":"materials/htcondor/part1-ex5-request/#determining-resource-needs-by-running-test-jobs-best","text":"Despite the techniques mentioned above, by far the easiest approach to measuring your job\u2019s resource needs is to run one or a small number of sample jobs and have HTCondor itself tell you about the resources used during the runs. For example, here is a strange Python script that does not do anything useful, but consumes some real resources while running: #!/usr/bin/env python3 import time import os size = 1000000 numbers = [] for i in range ( size ): numbers . append ( str ( i )) with open ( 'numbers.txt' , 'w' ) as tempfile : tempfile . write ( ' ' . join ( numbers )) time . sleep ( 60 ) Without trying to figure out what this code does or how many resources it uses, create a submit file for it, and run it once with HTCondor, starting with somewhat high memory requests (\"1GB\" for memory and disk is a good starting point, unless you think the job will use far more). When it is done, examine the log file. In particular, we care about these lines: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 6739 1048576 8022934 Memory (MB) : 3 1024 1024 So, now we know that HTCondor saw that the job used 6,739 KB of disk (= about 6.5 MB) and 3 MB of memory! This is a great technique for determining the real resource needs of your job. If you think resource needs vary from run to run, submit a few sample jobs and look at all the results. You should round up your resource requests a little, just in case your job occasionally uses more resources.","title":"Determining Resource Needs By Running Test Jobs (BEST)"},{"location":"materials/htcondor/part1-ex5-request/#setting-resource-requirements","text":"Once you know your job\u2019s resource requirements, it is easy to declare them in your submit file. For example, taking our results above as an example, we might slightly increase our requests above what was used, just to be safe: # rounded up from 3 MB request_memory = 4MB # rounded up from 6.5 MB request_disk = 7MB Pay close attention to units: Without explicit units, request_memory is in MB (megabytes) Without explicit units, request_disk is in KB (kilobytes) Allowable units are KB (kilobytes), MB (megabytes), GB (gigabytes), and TB (terabytes) HTCondor translates these requirements into attributes that become part of the job's requirements expression. However, do not put your CPU, memory, and disk requirements directly into the requirements expression; use the request_XXX statements instead. If you still have time in this working session, Add these requirements to your submit file for the Python script, rerun the job, and confirm in the log file that your requests were used. After changing the requirements in your submit file, did your job run successfully? If not, why? (Hint: HTCondor polls a job's resource use on a timer. How long are these jobs running for?)","title":"Setting Resource Requirements"},{"location":"materials/htcondor/part1-ex6-remove/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 1.6: Remove Jobs From the Queue \u00b6 Exercise Goal \u00b6 The goal of this exercise is to show you how to remove jobs from the queue. This is helpful if you make a mistake, do not want to wait for a job to complete, or otherwise need to fix things. For example, if some test jobs go on hold for using too much memory or disk, you may want to just remove them, edit the submit files, and then submit again. Skip this exercise and come back to it if you are short on time, or until you need to remove jobs for other exercises Note Please remember to remove any jobs from the queue that you are no longer interested in. Otherwise, the queue will start to get very long with jobs that will waste resources (and decrease your priority), or that may never run (if they're on hold, or have other issues keeping them from matching). This exercise is very short, but if you are out of time, you can come back to it later. Removing a Job or Cluster From the Queue \u00b6 To practice removing jobs from the queue, you need a job in the queue! Submit a job from an earlier exercise Determine the job ID ( cluster.process ) from the condor_submit output or from condor_q Remove the job: username@ap1 $ condor_rm Use the full job ID this time, e.g. 5759.0 . Did the job leave the queue immediately? If not, about how long did it take? So far, we have created job clusters that contain only one job process (the .0 part of the job ID). That will change soon, so it is good to know how to remove a specific job ID. However, it is possible to remove all jobs that are part of a cluster at once. Simply omit the job process (the .0 part of the job ID) in the condor_rm command: username@ap1 $ condor_rm Finally, you can include many job clusters and full job IDs in a single condor_rm command. For example: username@ap1 $ condor_rm 5768 5769 5770 .0 5771 .2 Removing All of Your Jobs \u00b6 If you really want to remove all of your jobs at once, you can do that with: username@ap1 $ condor_rm If you want to test it: (optional, though you'll likely need this in the future) Quickly submit several jobs from past exercises View the jobs in the queue with condor_q Remove them all with the above command Use condor_q to track progress In case you are wondering, you can remove only your own jobs. HTCondor administrators can remove anyone\u2019s jobs, so be nice to them.","title":"1.6 - Remove jobs from the queue"},{"location":"materials/htcondor/part1-ex6-remove/#htc-exercise-16-remove-jobs-from-the-queue","text":"","title":"HTC Exercise 1.6: Remove Jobs From the Queue"},{"location":"materials/htcondor/part1-ex6-remove/#exercise-goal","text":"The goal of this exercise is to show you how to remove jobs from the queue. This is helpful if you make a mistake, do not want to wait for a job to complete, or otherwise need to fix things. For example, if some test jobs go on hold for using too much memory or disk, you may want to just remove them, edit the submit files, and then submit again. Skip this exercise and come back to it if you are short on time, or until you need to remove jobs for other exercises Note Please remember to remove any jobs from the queue that you are no longer interested in. Otherwise, the queue will start to get very long with jobs that will waste resources (and decrease your priority), or that may never run (if they're on hold, or have other issues keeping them from matching). This exercise is very short, but if you are out of time, you can come back to it later.","title":"Exercise Goal"},{"location":"materials/htcondor/part1-ex6-remove/#removing-a-job-or-cluster-from-the-queue","text":"To practice removing jobs from the queue, you need a job in the queue! Submit a job from an earlier exercise Determine the job ID ( cluster.process ) from the condor_submit output or from condor_q Remove the job: username@ap1 $ condor_rm Use the full job ID this time, e.g. 5759.0 . Did the job leave the queue immediately? If not, about how long did it take? So far, we have created job clusters that contain only one job process (the .0 part of the job ID). That will change soon, so it is good to know how to remove a specific job ID. However, it is possible to remove all jobs that are part of a cluster at once. Simply omit the job process (the .0 part of the job ID) in the condor_rm command: username@ap1 $ condor_rm Finally, you can include many job clusters and full job IDs in a single condor_rm command. For example: username@ap1 $ condor_rm 5768 5769 5770 .0 5771 .2","title":"Removing a Job or Cluster From the Queue"},{"location":"materials/htcondor/part1-ex6-remove/#removing-all-of-your-jobs","text":"If you really want to remove all of your jobs at once, you can do that with: username@ap1 $ condor_rm If you want to test it: (optional, though you'll likely need this in the future) Quickly submit several jobs from past exercises View the jobs in the queue with condor_q Remove them all with the above command Use condor_q to track progress In case you are wondering, you can remove only your own jobs. HTCondor administrators can remove anyone\u2019s jobs, so be nice to them.","title":"Removing All of Your Jobs"},{"location":"materials/htcondor/part1-ex7-compile/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Bonus Exercise 1.7: Compile and Run Some C Code \u00b6 The goal of this exercise is to show that compiled code works just fine in HTCondor. It is mainly of interest to people who have their own C code to run (or C++, or really any compiled code, although Java would be handled a bit differently). Preparing a C Executable \u00b6 When preparing a C program for HTCondor, it is best to compile and link the executable statically, so that it does not depend on external libraries and their particular versions. Why is this important? When your compiled C program is sent to another machine for execution, that machine may not have the same libraries that you have on your submit machine (or wherever you compile the program). If the libraries are not available or are the wrong versions, your program may fail or, perhaps worse, silently produce the wrong results. Here is a simple C program to try using (thanks, Alain Roy): #include #include #include int main ( int argc , char ** argv ) { int sleep_time ; int input ; int failure ; if ( argc != 3 ) { printf ( \"Usage: simple \\n \" ); failure = 1 ; } else { sleep_time = atoi ( argv [ 1 ]); input = atoi ( argv [ 2 ]); printf ( \"Thinking really hard for %d seconds... \\n \" , sleep_time ); sleep ( sleep_time ); printf ( \"We calculated: %d \\n \" , input * 2 ); failure = 0 ; } return failure ; } Save that code to a file, for example, simple.c . Compile the program with static linking: username@ap1 $ gcc -static -o simple simple.c As always, test that you can run your command from the command line first. First, without arguments to make sure it fails correctly: username@ap1 $ ./simple and then with valid arguments: username@ap1 $ ./simple 5 21 Running a Compiled C Program \u00b6 Running the compiled program is no different than running any other program. Here is a submit file for the C program (call it simple.sub): executable = simple arguments = \"60 64\" output = c-program.out error = c-program.err log = c-program.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1MB queue Then submit the job as usual! In summary, it is easy to work with statically linked compiled code. It is possible to handle dynamically linked compiled code, but it is trickier. We will only mention this topic briefly during the lecture on Software.","title":"Bonus Exercise 1.7 - Compile and run some C code"},{"location":"materials/htcondor/part1-ex7-compile/#htc-bonus-exercise-17-compile-and-run-some-c-code","text":"The goal of this exercise is to show that compiled code works just fine in HTCondor. It is mainly of interest to people who have their own C code to run (or C++, or really any compiled code, although Java would be handled a bit differently).","title":"HTC Bonus Exercise 1.7: Compile and Run Some C Code"},{"location":"materials/htcondor/part1-ex7-compile/#preparing-a-c-executable","text":"When preparing a C program for HTCondor, it is best to compile and link the executable statically, so that it does not depend on external libraries and their particular versions. Why is this important? When your compiled C program is sent to another machine for execution, that machine may not have the same libraries that you have on your submit machine (or wherever you compile the program). If the libraries are not available or are the wrong versions, your program may fail or, perhaps worse, silently produce the wrong results. Here is a simple C program to try using (thanks, Alain Roy): #include #include #include int main ( int argc , char ** argv ) { int sleep_time ; int input ; int failure ; if ( argc != 3 ) { printf ( \"Usage: simple \\n \" ); failure = 1 ; } else { sleep_time = atoi ( argv [ 1 ]); input = atoi ( argv [ 2 ]); printf ( \"Thinking really hard for %d seconds... \\n \" , sleep_time ); sleep ( sleep_time ); printf ( \"We calculated: %d \\n \" , input * 2 ); failure = 0 ; } return failure ; } Save that code to a file, for example, simple.c . Compile the program with static linking: username@ap1 $ gcc -static -o simple simple.c As always, test that you can run your command from the command line first. First, without arguments to make sure it fails correctly: username@ap1 $ ./simple and then with valid arguments: username@ap1 $ ./simple 5 21","title":"Preparing a C Executable"},{"location":"materials/htcondor/part1-ex7-compile/#running-a-compiled-c-program","text":"Running the compiled program is no different than running any other program. Here is a submit file for the C program (call it simple.sub): executable = simple arguments = \"60 64\" output = c-program.out error = c-program.err log = c-program.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1MB queue Then submit the job as usual! In summary, it is easy to work with statically linked compiled code. It is possible to handle dynamically linked compiled code, but it is trickier. We will only mention this topic briefly during the lecture on Software.","title":"Running a Compiled C Program"},{"location":"materials/htcondor/part1-ex8-queue/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus HTC Exercise 1.8: Explore condor_q \u00b6 The goal of this exercise is try out some of the most common options to the condor_q command, so that you can view jobs effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_q expert! Selecting Jobs \u00b6 The condor_q program has many options for selecting which jobs are listed. You have already seen that the default mode is to show only your jobs in \"batch\" mode: username@ap1 $ condor_q You've seen that you can view all jobs (all users) in the submit node's queue by using the -all argument: username@ap1 $ condor_q -all And you've seen that you can view more details about queued jobs, with each separate job on a single line using the -nobatch option: username@ap1 $ condor_q -nobatch username@ap1 $ condor_q -all -nobatch Did you know you can also name one or more user IDs on the command line, in which case jobs for all of the named users are listed at once? username@ap1 $ condor_q To list just the jobs associated with a single cluster number: username@ap1 $ condor_q For example, if you want to see the jobs in cluster 5678 (i.e., 5678.0 , 5678.1 , etc.), you use condor_q 5678 . To list a specific job (i.e., cluster.process, as in 5678.0): username@ap1 $ condor_q For example, to see job ID 5678.1, you use condor_q 5678.1 . Note You can name more than one cluster, job ID, or combination thereof on the command line, in which case jobs for all of the named clusters and/or job IDs are listed. Let\u2019s get some practice using condor_q selections! Using a previous exercise, submit several sleep jobs. List all jobs in the queue \u2014 are there others besides your own? Practice using all forms of condor_q that you have learned: List just your jobs, with and without batching. List a specific cluster. List a specific job ID. Try listing several users at once. Try listing several clusters and job IDs at once. When there are a variety of jobs in the queue, try combining a username and a different user's cluster or job ID in the same command \u2014 what happens? Viewing a Job ClassAd \u00b6 You may have wondered why it is useful to be able to list a single job ID using condor_q . By itself, it may not be that useful. But, in combination with another option, it is very useful! If you add the -long option to condor_q (or its short form, -l ), it will show the complete ClassAd for each selected job, instead of the one-line summary that you have seen so far. Because job ClassAds may have 80\u201390 attributes (or more), it probably makes the most sense to show the ClassAd for a single job at a time. And you know how to show just one job! Here is what the command looks like: username@ap1 $ condor_q -long The output from this command is long and complex. Most of the attributes that HTCondor adds to a job are arcane and uninteresting for us now. But here are some examples of common, interesting attributes taken directly from condor_q output (except with some line breaks added to the Requirements attribute): MyType = \"Job\" Err = \"sleep.err\" UserLog = \"/home/cat/intro-2.1-queue/sleep.log\" Requirements = ( IsOSGSchoolSlot =?= true ) && ( TARGET.Arch == \"X86_64\" ) && ( TARGET.OpSys == \"LINUX\" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) ClusterId = 2420 WhenToTransferOutput = \"ON_EXIT\" Owner = \"cat\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Out = \"sleep.out\" Cmd = \"/bin/sleep\" Arguments = \"120\" Note Attributes are listed in no particular order and may change from time to time. Do not assume anything about the order of attributes in condor_q output. See what you can find in a job ClassAd from your own job. Using a previous exercise, submit a sleep job that sleeps for at least 3 minutes (180 seconds). Before the job executes, capture its ClassAd and save to a file: condor_q -l > classad-1.txt After the job starts execution but before it finishes, capture its ClassAd again and save to a file condor_q -l > classad-2.txt Now examine each saved ClassAd file. Here are a few things to look for: Can you find attributes that came from your submit file? (E.g., Cmd, Arguments, Out, Err, UserLog, and so forth) Can you find attributes that could have come from your submit file, but that HTCondor added for you? (E.g., Requirements) How many of the following attributes can you guess the meaning of? DiskUsage ImageSize BytesSent JobStatus Why Is My Job Not Running? \u00b6 Sometimes, you submit a job and it just sits in the queue in Idle state, never running. It can be difficult to figure out why a job never matches and runs. Fortunately, HTCondor can give you some help. To ask HTCondor why your job is not running, add the -better-analyze option to condor_q for the specific job. For example, for job ID 2423.0, the command is: username@ap1 $ condor_q -better-analyze 2423 .0 Of course, replace the job ID with your own. Let\u2019s submit a job that will never run and see what happens. Here is the submit file to use: executable = /bin/hostname output = norun.out error = norun.err log = norun.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_disk = 10MB request_memory = 8TB queue (Do you see what I did?) Save and submit this file. Run condor_q -better-analyze on the job ID. There is a lot of output, but a few items are worth highlighting. Here is a sample from my own job (with some lines omitted): -- Schedd: ap1.facility.path-cc.io : <128.105.68.66:9618?... ... Job 98096.000 defines the following attributes: RequestDisk = 10240 RequestMemory = 8388608 The Requirements expression for job 98096.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [1] 11227 Target.OpSysMajorVer == 7 [9] 13098 TARGET.Disk >= RequestDisk [11] 0 TARGET.Memory >= RequestMemory No successful match recorded. Last failed match: Fri Jul 12 15:36:30 2019 Reason for last match failure: no match found 98096.000: Run analysis summary ignoring user priority. Of 710 machines, 710 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are able to run your job ... At the end of the summary, condor_q provides a breakdown of how machines and their own requirements match against my own job's requirements. 710 total machines were considered above, and all of them were rejected based on my job's requirements . In other words, I am asking for something that is not available. But what? Further up in the output, there is an analysis of the job's requirements, along with how many slots within the pool match each of those requirements. The example above reports that 13098 slots match our small disk request request, but none of the slots matched the TARGET.Memory >= RequestMemory condition. The output also reports the value used for the RequestMemory attribute: my job asked for 8 terabytes of memory (8,388,608 MB) -- of course no machines matched that part of the expression! That's a lot of memory on today's machines. The output from condor_q -analyze (and condor_q -better-analyze ) may be helpful or it may not be, depending on your exact case. The example above was constructed so that it would be obvious what the problem was. But in many cases, this is a good place to start looking if you are having problems matching. Bonus: Automatic Formatting Output \u00b6 Do this exercise only if you have time, though it's pretty awesome! There is a way to select the specific job attributes you want condor_q to tell you about with the -autoformat or -af option. In this case, HTCondor decides for you how to format the data you ask for from job ClassAd(s). (To tell HTCondor how to specially format this information, yourself, you could use the -format option, which we're not covering.) To use autoformatting, use the -af option followed by the attribute name, for each attribute that you want to output: username@ap1 $ condor_q -all -af Owner ClusterId Cmd moate 2418 /share/test.sh cat 2421 /bin/sleep cat 2422 /bin/sleep Bonus Question : If you wanted to print out the Requirements expression of a job, how would you do that with -af ? Is the output what you expected? (HINT: for ClassAd attributes like \"Requirements\" that are long expressions, instead of plain values, you can use -af:r to view the expressions, instead of what it's current evaluation.) References \u00b6 As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_q man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the HTCondor Manual","title":"Bonus Exercise 1.8 - Explore condor_q"},{"location":"materials/htcondor/part1-ex8-queue/#bonus-htc-exercise-18-explore-condor_q","text":"The goal of this exercise is try out some of the most common options to the condor_q command, so that you can view jobs effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_q expert!","title":"Bonus HTC Exercise 1.8: Explore condor_q"},{"location":"materials/htcondor/part1-ex8-queue/#selecting-jobs","text":"The condor_q program has many options for selecting which jobs are listed. You have already seen that the default mode is to show only your jobs in \"batch\" mode: username@ap1 $ condor_q You've seen that you can view all jobs (all users) in the submit node's queue by using the -all argument: username@ap1 $ condor_q -all And you've seen that you can view more details about queued jobs, with each separate job on a single line using the -nobatch option: username@ap1 $ condor_q -nobatch username@ap1 $ condor_q -all -nobatch Did you know you can also name one or more user IDs on the command line, in which case jobs for all of the named users are listed at once? username@ap1 $ condor_q To list just the jobs associated with a single cluster number: username@ap1 $ condor_q For example, if you want to see the jobs in cluster 5678 (i.e., 5678.0 , 5678.1 , etc.), you use condor_q 5678 . To list a specific job (i.e., cluster.process, as in 5678.0): username@ap1 $ condor_q For example, to see job ID 5678.1, you use condor_q 5678.1 . Note You can name more than one cluster, job ID, or combination thereof on the command line, in which case jobs for all of the named clusters and/or job IDs are listed. Let\u2019s get some practice using condor_q selections! Using a previous exercise, submit several sleep jobs. List all jobs in the queue \u2014 are there others besides your own? Practice using all forms of condor_q that you have learned: List just your jobs, with and without batching. List a specific cluster. List a specific job ID. Try listing several users at once. Try listing several clusters and job IDs at once. When there are a variety of jobs in the queue, try combining a username and a different user's cluster or job ID in the same command \u2014 what happens?","title":"Selecting Jobs"},{"location":"materials/htcondor/part1-ex8-queue/#viewing-a-job-classad","text":"You may have wondered why it is useful to be able to list a single job ID using condor_q . By itself, it may not be that useful. But, in combination with another option, it is very useful! If you add the -long option to condor_q (or its short form, -l ), it will show the complete ClassAd for each selected job, instead of the one-line summary that you have seen so far. Because job ClassAds may have 80\u201390 attributes (or more), it probably makes the most sense to show the ClassAd for a single job at a time. And you know how to show just one job! Here is what the command looks like: username@ap1 $ condor_q -long The output from this command is long and complex. Most of the attributes that HTCondor adds to a job are arcane and uninteresting for us now. But here are some examples of common, interesting attributes taken directly from condor_q output (except with some line breaks added to the Requirements attribute): MyType = \"Job\" Err = \"sleep.err\" UserLog = \"/home/cat/intro-2.1-queue/sleep.log\" Requirements = ( IsOSGSchoolSlot =?= true ) && ( TARGET.Arch == \"X86_64\" ) && ( TARGET.OpSys == \"LINUX\" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) ClusterId = 2420 WhenToTransferOutput = \"ON_EXIT\" Owner = \"cat\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Out = \"sleep.out\" Cmd = \"/bin/sleep\" Arguments = \"120\" Note Attributes are listed in no particular order and may change from time to time. Do not assume anything about the order of attributes in condor_q output. See what you can find in a job ClassAd from your own job. Using a previous exercise, submit a sleep job that sleeps for at least 3 minutes (180 seconds). Before the job executes, capture its ClassAd and save to a file: condor_q -l > classad-1.txt After the job starts execution but before it finishes, capture its ClassAd again and save to a file condor_q -l > classad-2.txt Now examine each saved ClassAd file. Here are a few things to look for: Can you find attributes that came from your submit file? (E.g., Cmd, Arguments, Out, Err, UserLog, and so forth) Can you find attributes that could have come from your submit file, but that HTCondor added for you? (E.g., Requirements) How many of the following attributes can you guess the meaning of? DiskUsage ImageSize BytesSent JobStatus","title":"Viewing a Job ClassAd"},{"location":"materials/htcondor/part1-ex8-queue/#why-is-my-job-not-running","text":"Sometimes, you submit a job and it just sits in the queue in Idle state, never running. It can be difficult to figure out why a job never matches and runs. Fortunately, HTCondor can give you some help. To ask HTCondor why your job is not running, add the -better-analyze option to condor_q for the specific job. For example, for job ID 2423.0, the command is: username@ap1 $ condor_q -better-analyze 2423 .0 Of course, replace the job ID with your own. Let\u2019s submit a job that will never run and see what happens. Here is the submit file to use: executable = /bin/hostname output = norun.out error = norun.err log = norun.log should_transfer_files = YES when_to_transfer_output = ON_EXIT request_disk = 10MB request_memory = 8TB queue (Do you see what I did?) Save and submit this file. Run condor_q -better-analyze on the job ID. There is a lot of output, but a few items are worth highlighting. Here is a sample from my own job (with some lines omitted): -- Schedd: ap1.facility.path-cc.io : <128.105.68.66:9618?... ... Job 98096.000 defines the following attributes: RequestDisk = 10240 RequestMemory = 8388608 The Requirements expression for job 98096.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [1] 11227 Target.OpSysMajorVer == 7 [9] 13098 TARGET.Disk >= RequestDisk [11] 0 TARGET.Memory >= RequestMemory No successful match recorded. Last failed match: Fri Jul 12 15:36:30 2019 Reason for last match failure: no match found 98096.000: Run analysis summary ignoring user priority. Of 710 machines, 710 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are able to run your job ... At the end of the summary, condor_q provides a breakdown of how machines and their own requirements match against my own job's requirements. 710 total machines were considered above, and all of them were rejected based on my job's requirements . In other words, I am asking for something that is not available. But what? Further up in the output, there is an analysis of the job's requirements, along with how many slots within the pool match each of those requirements. The example above reports that 13098 slots match our small disk request request, but none of the slots matched the TARGET.Memory >= RequestMemory condition. The output also reports the value used for the RequestMemory attribute: my job asked for 8 terabytes of memory (8,388,608 MB) -- of course no machines matched that part of the expression! That's a lot of memory on today's machines. The output from condor_q -analyze (and condor_q -better-analyze ) may be helpful or it may not be, depending on your exact case. The example above was constructed so that it would be obvious what the problem was. But in many cases, this is a good place to start looking if you are having problems matching.","title":"Why Is My Job Not Running?"},{"location":"materials/htcondor/part1-ex8-queue/#bonus-automatic-formatting-output","text":"Do this exercise only if you have time, though it's pretty awesome! There is a way to select the specific job attributes you want condor_q to tell you about with the -autoformat or -af option. In this case, HTCondor decides for you how to format the data you ask for from job ClassAd(s). (To tell HTCondor how to specially format this information, yourself, you could use the -format option, which we're not covering.) To use autoformatting, use the -af option followed by the attribute name, for each attribute that you want to output: username@ap1 $ condor_q -all -af Owner ClusterId Cmd moate 2418 /share/test.sh cat 2421 /bin/sleep cat 2422 /bin/sleep Bonus Question : If you wanted to print out the Requirements expression of a job, how would you do that with -af ? Is the output what you expected? (HINT: for ClassAd attributes like \"Requirements\" that are long expressions, instead of plain values, you can use -af:r to view the expressions, instead of what it's current evaluation.)","title":"Bonus: Automatic Formatting Output"},{"location":"materials/htcondor/part1-ex8-queue/#references","text":"As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_q man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the HTCondor Manual","title":"References"},{"location":"materials/htcondor/part1-ex9-status/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus HTC Exercise 1.9: Explore condor_status \u00b6 The goal of this exercise is try out some of the most common options to the condor_status command, so that you can view slots effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_status expert! Selecting Slots \u00b6 The condor_status program has many options for selecting which slots are listed. You've already learned the basic condor_status and the condor_status -compact variation (which you may wish to retry now, before proceeding). Another convenient option is to list only those slots that are available now: username@ap1 $ condor_status -avail Of course, the individual execute machines only report their slots to the collector at certain time intervals, so this list will not reflect the up-to-the-second reality of all slots. But this limitation is true of all condor_status output, not just with the -avail option. Similar to condor_q , you can limit the slots that are listed in two easy ways. To list just the slots on a specific machine: username@ap1 $ condor_status For example, if you want to see the slots on e2337.chtc.wisc.edu (in the CHTC pool): username@ap1 $ condor_status e2337.chtc.wisc.edu To list a specific slot on a machine: username@ap1 $ condor_status @ For example, to see the \u201cfirst\u201d slot on the machine above: username@ap1 $ condor_status slot1@e2337.chtc.wisc.edu Note You can name more than one hostname, slot, or combination thereof on the command line, in which case slots for all of the named hostnames and/or slots are listed. Let\u2019s get some practice using condor_status selections! List all slots in the pool \u2014 how many are there total? Practice using all forms of condor_status that you have learned: List the available slots. List the slots on a specific machine (e.g., e2337.chtc.wisc.edu ). List a specific slot from that machine. Try listing the slots from a few (but not all) machines at once. Try using a mix of hostnames and slot IDs at once. Viewing a Slot ClassAd \u00b6 Just as with condor_q , you can use condor_status to view the complete ClassAd for a given slot (often confusingly called the \u201cmachine\u201d ad): username@ap1 $ condor_status -long @ Because slot ClassAds may have 150\u2013200 attributes (or more), it probably makes the most sense to show the ClassAd for a single slot at a time, as shown above. Here are some examples of common, interesting attributes taken directly from condor_status output: OpSys = \"LINUX\" DetectedCpus = 24 OpSysAndVer = \"SL6\" MyType = \"Machine\" LoadAvg = 0.99 TotalDisk = 798098404 OSIssue = \"Scientific Linux release 6.6 (Carbon)\" TotalMemory = 24016 Machine = \"e242.chtc.wisc.edu\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Memory = 1024 As you may be able to tell, there is a mix of attributes about the machine as a whole (hence the name \u201cmachine ad\u201d) and about the slot in particular. Go ahead and examine a machine ClassAd now. Viewing Slots by ClassAd Expression \u00b6 Often, it is helpful to view slots that meet some particular criteria. For example, if you know that your job needs a lot of memory to run, you may want to see how many high-memory slots there are and whether they are busy. You can filter the list of slots like this using the -constraint option and a ClassAd expression. For example, suppose we want to list all slots that are running Scientific Linux 7 (operating system) and have at least 16 GB memory available. Note that memory is reported in units of Megabytes. The command is: username@ap1 $ condor_status -constraint 'OpSysAndVer == \"CentOS7\" && Memory >= 16000' Note Be very careful with using quote characters appropriately in these commands. In the example above, the single quotes ( ' ) are for the shell, so that the entire expression is passed to condor_status untouched, and the double quotes ( \" ) surround a string value within the expression itself. Currently on PATh, there are only a few slots that meet these criteria (our high-memory servers, mainly used for metagenomics assemblies). If you are interested in learning more about writing ClassAd expressions, look at section 4.1 and especially 4.1.4 of the HTCondor Manual. This is definitely advanced material, so if you do not want to read it, that is fine. But if you do, take some time to practice writing expressions for the condor_status -constraint command. Note The condor_q command accepts the -constraint option as well! As you might expect, the option allows you to limit the jobs that are listed based on a ClassAd expression. Bonus: Formatting Output \u00b6 The condor_status command accepts the same -autoformat ( -af ) options that condor_q accepts, and the options have the same meanings in both commands. Of course, the attributes available in machine ads may differ from the ones that are available in job ads. Use the HTCondor Manual or look at individual slot ClassAds to get a better idea of what attributes are available. For example, I was curious about the host name and operating system of the slots with more than 32GB of memory: username@ap1 $ condor_status -af Machine -af OpSysAndVer -constraint 'Memory >= 32000' If you like, spend a few minutes now or later experimenting with condor_status formatting. References \u00b6 As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_status man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the appendix of the HTCondor Manual Read about ClassAd expressions in section 4.1.4 of the HTCondor Manual","title":"Bonus Exercise 1.9- Explore condor_stataus"},{"location":"materials/htcondor/part1-ex9-status/#bonus-htc-exercise-19-explore-condor_status","text":"The goal of this exercise is try out some of the most common options to the condor_status command, so that you can view slots effectively. The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_status expert!","title":"Bonus HTC Exercise 1.9: Explore condor_status"},{"location":"materials/htcondor/part1-ex9-status/#selecting-slots","text":"The condor_status program has many options for selecting which slots are listed. You've already learned the basic condor_status and the condor_status -compact variation (which you may wish to retry now, before proceeding). Another convenient option is to list only those slots that are available now: username@ap1 $ condor_status -avail Of course, the individual execute machines only report their slots to the collector at certain time intervals, so this list will not reflect the up-to-the-second reality of all slots. But this limitation is true of all condor_status output, not just with the -avail option. Similar to condor_q , you can limit the slots that are listed in two easy ways. To list just the slots on a specific machine: username@ap1 $ condor_status For example, if you want to see the slots on e2337.chtc.wisc.edu (in the CHTC pool): username@ap1 $ condor_status e2337.chtc.wisc.edu To list a specific slot on a machine: username@ap1 $ condor_status @ For example, to see the \u201cfirst\u201d slot on the machine above: username@ap1 $ condor_status slot1@e2337.chtc.wisc.edu Note You can name more than one hostname, slot, or combination thereof on the command line, in which case slots for all of the named hostnames and/or slots are listed. Let\u2019s get some practice using condor_status selections! List all slots in the pool \u2014 how many are there total? Practice using all forms of condor_status that you have learned: List the available slots. List the slots on a specific machine (e.g., e2337.chtc.wisc.edu ). List a specific slot from that machine. Try listing the slots from a few (but not all) machines at once. Try using a mix of hostnames and slot IDs at once.","title":"Selecting Slots"},{"location":"materials/htcondor/part1-ex9-status/#viewing-a-slot-classad","text":"Just as with condor_q , you can use condor_status to view the complete ClassAd for a given slot (often confusingly called the \u201cmachine\u201d ad): username@ap1 $ condor_status -long @ Because slot ClassAds may have 150\u2013200 attributes (or more), it probably makes the most sense to show the ClassAd for a single slot at a time, as shown above. Here are some examples of common, interesting attributes taken directly from condor_status output: OpSys = \"LINUX\" DetectedCpus = 24 OpSysAndVer = \"SL6\" MyType = \"Machine\" LoadAvg = 0.99 TotalDisk = 798098404 OSIssue = \"Scientific Linux release 6.6 (Carbon)\" TotalMemory = 24016 Machine = \"e242.chtc.wisc.edu\" CondorVersion = \"$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $\" Memory = 1024 As you may be able to tell, there is a mix of attributes about the machine as a whole (hence the name \u201cmachine ad\u201d) and about the slot in particular. Go ahead and examine a machine ClassAd now.","title":"Viewing a Slot ClassAd"},{"location":"materials/htcondor/part1-ex9-status/#viewing-slots-by-classad-expression","text":"Often, it is helpful to view slots that meet some particular criteria. For example, if you know that your job needs a lot of memory to run, you may want to see how many high-memory slots there are and whether they are busy. You can filter the list of slots like this using the -constraint option and a ClassAd expression. For example, suppose we want to list all slots that are running Scientific Linux 7 (operating system) and have at least 16 GB memory available. Note that memory is reported in units of Megabytes. The command is: username@ap1 $ condor_status -constraint 'OpSysAndVer == \"CentOS7\" && Memory >= 16000' Note Be very careful with using quote characters appropriately in these commands. In the example above, the single quotes ( ' ) are for the shell, so that the entire expression is passed to condor_status untouched, and the double quotes ( \" ) surround a string value within the expression itself. Currently on PATh, there are only a few slots that meet these criteria (our high-memory servers, mainly used for metagenomics assemblies). If you are interested in learning more about writing ClassAd expressions, look at section 4.1 and especially 4.1.4 of the HTCondor Manual. This is definitely advanced material, so if you do not want to read it, that is fine. But if you do, take some time to practice writing expressions for the condor_status -constraint command. Note The condor_q command accepts the -constraint option as well! As you might expect, the option allows you to limit the jobs that are listed based on a ClassAd expression.","title":"Viewing Slots by ClassAd Expression"},{"location":"materials/htcondor/part1-ex9-status/#bonus-formatting-output","text":"The condor_status command accepts the same -autoformat ( -af ) options that condor_q accepts, and the options have the same meanings in both commands. Of course, the attributes available in machine ads may differ from the ones that are available in job ads. Use the HTCondor Manual or look at individual slot ClassAds to get a better idea of what attributes are available. For example, I was curious about the host name and operating system of the slots with more than 32GB of memory: username@ap1 $ condor_status -af Machine -af OpSysAndVer -constraint 'Memory >= 32000' If you like, spend a few minutes now or later experimenting with condor_status formatting.","title":"Bonus: Formatting Output"},{"location":"materials/htcondor/part1-ex9-status/#references","text":"As suggested above, if you want to learn more about condor_q , you can do some reading: Read the condor_status man page or HTCondor Manual section (same text) to learn about more options Read about ClassAd attributes in the appendix of the HTCondor Manual Read about ClassAd expressions in section 4.1.4 of the HTCondor Manual","title":"References"},{"location":"materials/htcondor/part2-ex1-files/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 2.1: Work With Input and Output Files \u00b6 Exercise Goal \u00b6 The goal of this exercise is make input files available to your job on the execute machine and to return output files back created in your job back to you on the access point. This small change significantly adds to the kinds of jobs that you can run. Viewing a Job Sandbox \u00b6 Before you learn to transfer files to and from your job, it is good to understand a bit more about the environment in which your job runs. When the HTCondor starter process prepares to run your job, it creates a new directory for your job and all of its files. We call this directory the job sandbox , because it is your job\u2019s private space to play. Let\u2019s see what is in the job sandbox for a minimal job with no special input or output files. Save the script below in a file named sandbox.sh : #!/bin/sh echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'Sandbox: ' ` pwd ` ls -alF # END Create a submit file for this script and submit it. When the job finishes, look at the contents of the output file. In the output file, note the Sandbox: line: That is the full path to your job sandbox for the run. It was created just for your job, and it was removed as soon as your job finished. Next, look at the output that appears after the Sandbox: line; it is the output from the ls command in the script. It shows all of the files in your job sandbox, as they existed at the end of the execution of sandbox.sh . The number of files that you see can change depending on the HTC system you are using, but some of the files you should always see are: .chirp.config Configuration for an advanced feature sandbox.sh Your executable .job.ad The job ClassAd .machine.ad The machine ClassAd _condor_stderr Saved standard error from the job _condor_stdout Saved standard output from the job tmp/ , var/tmp/ Directories in which to put temporary files So, HTCondor wrote copies of the job and machine ads (for use by the job, if desired), transferred your executable ( sandbox.sh ), ran it, and saved its standard output and standard error into files. Notice that your submit file, which was in the same directory on the access point machine as your executable, was not transferred, nor were any other files that happened to be in directory with the submit file. Now that we know something about the sandbox, we can transfer more files to and from it. Running a Job With Input Files \u00b6 Next, you will run a job that requires an input file. Remember, the initial job sandbox will contain only the job executable, unless you tell HTCondor explicitly about every other file that needs to be transferred to the job. Here is a Python script that takes the name of an input file (containing one word per line) from the command line, counts the number of times each (lowercased) word occurs in the text, and prints out the final list of words and their counts. #!/usr/bin/env python3 import os import sys if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' , encoding = 'iso-8859-1' ) as my_file : for line in my_file : word = line . strip () . lower () if word in words : words [ word ] += 1 else : words [ word ] = 1 for word in sorted ( words . keys ()): print ( f ' { words [ word ] : 8d } { word } ' ) Create and save the Python script in a file named freq.py . Download the input file for the script (263K lines, ~1.4 MB) and save it in your submit directory: username@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/intro-2.1-words.txt Create a submit file for the freq.py executable. Add a line called transfer_input_files = to tell HTCondor to transfer the input file to the job: transfer_input_files = intro-2.1-words.txt As with all submit file commands, it does not matter where this line goes, as long as it comes before the word queue . Since we want HTCondor to pass an argument to our Python executable, we need to remember to add an arguments = line in our submit file so that HTCondor knows to pass an argument to the job. Set this arguments = line equal to the argument to the Python script (i.e., the name the input file). Submit the job to HTCondor, wait for it to finish, and check the output! If things do not work the first time, keep trying! At this point in the exercises, we are telling you less and less explicitly how to do steps that you have done before. If you get stuck, ask for help in the Slack channel. Note If you want to transfer more than one input file, list all of them on a single transfer_input_files command, separated by commas. For example, if there are three input files: transfer_input_files = a.txt, b.txt, c.txt Transferring Output Files \u00b6 So far, we have relied on programs that send their output to the standard output and error streams, which HTCondor captures, saves, and returns back to the submit directory. But what if your program writes one or more files for its output? How do you tell HTCondor to bring them back? Let\u2019s start by exploring what happens to files that a job creates in the sandbox. We will use a very simple method for creating a new file: we will copy an input file to another name. Find or create a small input file (it is fine to use any small file from a previous exercise). Create a submit file that transfers the input file and copies it to another name (as if doing /bin/cp input.txt output.txt on the command line) Make the output filename different than any filenames that are in your submit directory What is the executable line? What is the arguments line? How do you tell HTCondor to transfer the input file? As always, use output , error , and log filenames that are different from previous exercises Submit the job and wait for it to finish. What happened? Can you tell what HTCondor did with the output file that was created (did it end up back on the access point?), after it was created in the job sandbox? Look carefully at the list of files in your submit directory now. Transferring Specific Output Files \u00b6 As you saw in the last exercise, by default HTCondor transfers files that are created in the job sandbox back to the submit directory when the job finishes. In fact, HTCondor will also transfer back changed input files, too. But, this only works for files that are in the top-level sandbox directory, and not for ones contained in subdirectories. What if you want to bring back only some output files, or output files contained in subdirectories? Here is a shell script that creates several files, including a copy of an input file in a new subdirectory: #!/bin/sh if [ $# -ne 1 ] ; then echo \"Usage: $0 INPUT\" ; exit 1 ; fi date > output-timestamp.txt cal > output-calendar.txt mkdir subdirectory cp $1 subdirectory/backup- $1 First, let\u2019s confirm that HTCondor does not bring back the output file (which starts with the prefix backup- ) in the subdirectory: Create a file called output.sh and save the above shell script in this file. Write a submit file that transfers any input file and runs output.sh on it (remember to include an arguments = line and pass the input filename as an argument). Submit the job, wait for it to finish, and examine the contents of your submit directory. Suppose you decide that you want only the timestamp output file and all files in the subdirectory, but not the calendar output file. You can tell HTCondor to only transfer these specific files back to the submission directory using transfer_output_files = : transfer_output_files = output-timestamp.txt, subdirectory/ When using transfer_output_files = , HTCondor will only transfer back the files you name - all other files will be ignored and deleted at the end of a job. Note See the trailing slash ( / ) on the subdirectory? That tells HTCondor to transfer back the files contained in the subdirectory, but not the directory itself ; the files will be written directly into the submit directory. If you want HTCondor to transfer back an entire directory, leave off the trailing slash. Remove all output files from the previous run, including output-timestamp.txt and output-calendar.txt . Copy the previous submit file that ran output.sh and add the transfer_output_files line from above. Submit the job, wait for it to finish, and examine the contents of your submit directory. Did it work as you expected? Thinking About Progress So Far \u00b6 At this point, you can do just about everything that you need in order to run jobs on a HTC pool. You can identify the executable, arguments, and input files, and you can get output back from the job. This is a big achievement! References \u00b6 There are many more details about HTCondor\u2019s file transfer mechanism not covered here. For more information, read \"Submitting Jobs Without a Shared Filesystem\" in the HTCondor Manual.","title":"2.1 - Work with input and output files"},{"location":"materials/htcondor/part2-ex1-files/#htc-exercise-21-work-with-input-and-output-files","text":"","title":"HTC Exercise 2.1: Work With Input and Output Files"},{"location":"materials/htcondor/part2-ex1-files/#exercise-goal","text":"The goal of this exercise is make input files available to your job on the execute machine and to return output files back created in your job back to you on the access point. This small change significantly adds to the kinds of jobs that you can run.","title":"Exercise Goal"},{"location":"materials/htcondor/part2-ex1-files/#viewing-a-job-sandbox","text":"Before you learn to transfer files to and from your job, it is good to understand a bit more about the environment in which your job runs. When the HTCondor starter process prepares to run your job, it creates a new directory for your job and all of its files. We call this directory the job sandbox , because it is your job\u2019s private space to play. Let\u2019s see what is in the job sandbox for a minimal job with no special input or output files. Save the script below in a file named sandbox.sh : #!/bin/sh echo 'Date: ' ` date ` echo 'Host: ' ` hostname ` echo 'Sandbox: ' ` pwd ` ls -alF # END Create a submit file for this script and submit it. When the job finishes, look at the contents of the output file. In the output file, note the Sandbox: line: That is the full path to your job sandbox for the run. It was created just for your job, and it was removed as soon as your job finished. Next, look at the output that appears after the Sandbox: line; it is the output from the ls command in the script. It shows all of the files in your job sandbox, as they existed at the end of the execution of sandbox.sh . The number of files that you see can change depending on the HTC system you are using, but some of the files you should always see are: .chirp.config Configuration for an advanced feature sandbox.sh Your executable .job.ad The job ClassAd .machine.ad The machine ClassAd _condor_stderr Saved standard error from the job _condor_stdout Saved standard output from the job tmp/ , var/tmp/ Directories in which to put temporary files So, HTCondor wrote copies of the job and machine ads (for use by the job, if desired), transferred your executable ( sandbox.sh ), ran it, and saved its standard output and standard error into files. Notice that your submit file, which was in the same directory on the access point machine as your executable, was not transferred, nor were any other files that happened to be in directory with the submit file. Now that we know something about the sandbox, we can transfer more files to and from it.","title":"Viewing a Job Sandbox"},{"location":"materials/htcondor/part2-ex1-files/#running-a-job-with-input-files","text":"Next, you will run a job that requires an input file. Remember, the initial job sandbox will contain only the job executable, unless you tell HTCondor explicitly about every other file that needs to be transferred to the job. Here is a Python script that takes the name of an input file (containing one word per line) from the command line, counts the number of times each (lowercased) word occurs in the text, and prints out the final list of words and their counts. #!/usr/bin/env python3 import os import sys if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' , encoding = 'iso-8859-1' ) as my_file : for line in my_file : word = line . strip () . lower () if word in words : words [ word ] += 1 else : words [ word ] = 1 for word in sorted ( words . keys ()): print ( f ' { words [ word ] : 8d } { word } ' ) Create and save the Python script in a file named freq.py . Download the input file for the script (263K lines, ~1.4 MB) and save it in your submit directory: username@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/intro-2.1-words.txt Create a submit file for the freq.py executable. Add a line called transfer_input_files = to tell HTCondor to transfer the input file to the job: transfer_input_files = intro-2.1-words.txt As with all submit file commands, it does not matter where this line goes, as long as it comes before the word queue . Since we want HTCondor to pass an argument to our Python executable, we need to remember to add an arguments = line in our submit file so that HTCondor knows to pass an argument to the job. Set this arguments = line equal to the argument to the Python script (i.e., the name the input file). Submit the job to HTCondor, wait for it to finish, and check the output! If things do not work the first time, keep trying! At this point in the exercises, we are telling you less and less explicitly how to do steps that you have done before. If you get stuck, ask for help in the Slack channel. Note If you want to transfer more than one input file, list all of them on a single transfer_input_files command, separated by commas. For example, if there are three input files: transfer_input_files = a.txt, b.txt, c.txt","title":"Running a Job With Input Files"},{"location":"materials/htcondor/part2-ex1-files/#transferring-output-files","text":"So far, we have relied on programs that send their output to the standard output and error streams, which HTCondor captures, saves, and returns back to the submit directory. But what if your program writes one or more files for its output? How do you tell HTCondor to bring them back? Let\u2019s start by exploring what happens to files that a job creates in the sandbox. We will use a very simple method for creating a new file: we will copy an input file to another name. Find or create a small input file (it is fine to use any small file from a previous exercise). Create a submit file that transfers the input file and copies it to another name (as if doing /bin/cp input.txt output.txt on the command line) Make the output filename different than any filenames that are in your submit directory What is the executable line? What is the arguments line? How do you tell HTCondor to transfer the input file? As always, use output , error , and log filenames that are different from previous exercises Submit the job and wait for it to finish. What happened? Can you tell what HTCondor did with the output file that was created (did it end up back on the access point?), after it was created in the job sandbox? Look carefully at the list of files in your submit directory now.","title":"Transferring Output Files"},{"location":"materials/htcondor/part2-ex1-files/#transferring-specific-output-files","text":"As you saw in the last exercise, by default HTCondor transfers files that are created in the job sandbox back to the submit directory when the job finishes. In fact, HTCondor will also transfer back changed input files, too. But, this only works for files that are in the top-level sandbox directory, and not for ones contained in subdirectories. What if you want to bring back only some output files, or output files contained in subdirectories? Here is a shell script that creates several files, including a copy of an input file in a new subdirectory: #!/bin/sh if [ $# -ne 1 ] ; then echo \"Usage: $0 INPUT\" ; exit 1 ; fi date > output-timestamp.txt cal > output-calendar.txt mkdir subdirectory cp $1 subdirectory/backup- $1 First, let\u2019s confirm that HTCondor does not bring back the output file (which starts with the prefix backup- ) in the subdirectory: Create a file called output.sh and save the above shell script in this file. Write a submit file that transfers any input file and runs output.sh on it (remember to include an arguments = line and pass the input filename as an argument). Submit the job, wait for it to finish, and examine the contents of your submit directory. Suppose you decide that you want only the timestamp output file and all files in the subdirectory, but not the calendar output file. You can tell HTCondor to only transfer these specific files back to the submission directory using transfer_output_files = : transfer_output_files = output-timestamp.txt, subdirectory/ When using transfer_output_files = , HTCondor will only transfer back the files you name - all other files will be ignored and deleted at the end of a job. Note See the trailing slash ( / ) on the subdirectory? That tells HTCondor to transfer back the files contained in the subdirectory, but not the directory itself ; the files will be written directly into the submit directory. If you want HTCondor to transfer back an entire directory, leave off the trailing slash. Remove all output files from the previous run, including output-timestamp.txt and output-calendar.txt . Copy the previous submit file that ran output.sh and add the transfer_output_files line from above. Submit the job, wait for it to finish, and examine the contents of your submit directory. Did it work as you expected?","title":"Transferring Specific Output Files"},{"location":"materials/htcondor/part2-ex1-files/#thinking-about-progress-so-far","text":"At this point, you can do just about everything that you need in order to run jobs on a HTC pool. You can identify the executable, arguments, and input files, and you can get output back from the job. This is a big achievement!","title":"Thinking About Progress So Far"},{"location":"materials/htcondor/part2-ex1-files/#references","text":"There are many more details about HTCondor\u2019s file transfer mechanism not covered here. For more information, read \"Submitting Jobs Without a Shared Filesystem\" in the HTCondor Manual.","title":"References"},{"location":"materials/htcondor/part2-ex2-queue-n/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 2.2: Use queue N , $(Cluster), and $(Process) \u00b6 Background \u00b6 Suppose you have a program that you want to run many times with different arguments each time. With what you know so far, you have a couple of choices: Write one submit file; submit one job, change the argument in the submit file, submit another job, change the submit file, \u2026 Write many submit files that are nearly identical except for the program argument Neither of these options seems very satisfying. Fortunately, HTCondor's queue statement is here to help! Exercise Goal \u00b6 The goal of the next several exercises is to learn to submit many jobs from a single HTCondor queue statement, and to control things like filenames and arguments on a per-job basis when doing so. Running Many Jobs With One queue Statement \u00b6 Example Here is a C program that uses a stochastic (random) method to estimate the value of \u03c0. The single argument to the program is the number of samples to take. More samples should result in better estimates! #include #include #include int main ( int argc , char * argv []) { struct timeval my_timeval ; int iterations = 0 ; int inside_circle = 0 ; int i ; double x , y , pi_estimate ; gettimeofday ( & my_timeval , NULL ); srand48 ( my_timeval . tv_sec ^ my_timeval . tv_usec ); if ( argc == 2 ) { iterations = atoi ( argv [ 1 ]); } else { printf ( \"usage: circlepi ITERATIONS \\n \" ); exit ( 1 ); } for ( i = 0 ; i < iterations ; i ++ ) { x = ( drand48 () - 0.5 ) * 2.0 ; y = ( drand48 () - 0.5 ) * 2.0 ; if ((( x * x ) + ( y * y )) <= 1.0 ) { inside_circle ++ ; } } pi_estimate = 4.0 * (( double ) inside_circle / ( double ) iterations ); printf ( \"%d iterations, %d inside; pi = %f \\n \" , iterations , inside_circle , pi_estimate ); return 0 ; } In a new directory for this exercise, create and save the code to a file named circlepi.c Compile the code (we will cover this in more detail during the Software lecture): username@ap1 $ gcc -o circlepi circlepi.c Test the program with just 1000 samples: username@ap1 $ ./circlepi 1000 Now suppose that you want to run the program many times, to produce many estimates. To do so, we can tell HTCondor how many jobs to \"queue up\" via the queue statement we've been putting at the end of each of our submit files. Let\u2019s see how it works: Write a normal submit file for this program Pass 1 million ( 1000000 ) as the command line argument to circlepi Make sure to include log , output , and error (with filenames like circlepi.log ), and request_* lines At the end of the file, write queue 3 instead of just queue (\"queue 3 jobs\" vs. \"queue a job\"). Submit the file. Note the slightly different message from condor_submit : 3 job(s) submitted to cluster *NNNN*. Before the jobs execute, look at the job queue to see the multiple jobs Here is some sample condor_q -nobatch output: ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 10228.0 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.1 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.2 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 In this sample, all three jobs are part of cluster 10228 , but the first job was assigned process 0 , the second job was assigned process 1 , and the third one was assigned process 2 . (Programmers like to start counting from 0.) Now we can understand what the first column in the output, the job ID , represents. It is a job\u2019s cluster number , a dot ( . ), and the job\u2019s process number . So in the example above, the job ID of the second job is 10228.1 . Pop Quiz: Do you remember how to ask HTCondor's queue to list the status of all of the jobs from one cluster? How about one specific job ID? Using queue N With Output \u00b6 When all three jobs in your single cluster are finished, examine the resulting files. What is in the output file? What is in the error file? (hopefully it is empty!) What is in the log file? Look carefully at the job IDs in each event. Is this what you expected? Is it what you wanted? If the output is not what you expected, what do you think happened? Using $(Process) to Distinguish Jobs \u00b6 As you saw with the experiment above, each job ended up overwriting the same output and error filenames in the submission directory. After all, we didn't tell it to behave any differently when it ran three jobs. We need a way to separate output (and error) files per job that is queued , not just for the whole cluster of jobs. Fortunately, HTCondor has a way to separate the files easily. When processing a submit file, HTCondor will replace any instance of $(Process) with the process number of the job, for each job that is queued. For example, you can use the $(Process) variable to define a separate output file name for each job: output = my-output-file-$(Process).out queue 10 Even though the output filename is defined only once, HTCondor will create separate output filenames for each job: First job my-output-file-0.out Second job my-output-file-1.out Third job my-output-file-2.out ... ... Last (tenth) job my-output-file-9.out Let\u2019s see how this works for our program that estimates \u03c0. In your submit file, change the definitions of output and error to use $(Process) in the filename, similar to the example above. Delete any standard output, standard error, and log files from previous runs. Submit the updated file. When all three jobs are finished, examine the resulting files again. How many files are there of each type? What are their names? Is this what you expected? Is it what you wanted from the \u03c0 estimation process? Using $(Cluster) to Separate Files Across Runs \u00b6 With $(Process) , you can get separate output (and error) filenames for each job within a run. However, the next time you submit the same file, all of the output and error files are overwritten by new ones created by the new jobs. Maybe this is the behavior that you want. But sometimes, you may want to separate files by run, as well. In addition to $(Process) , there is also a $(Cluster) variable that you can use in your submit files. It works just like $(Process) , except it is replaced with the cluster number of the entire submission. Because the cluster number is the same for all jobs within a single submission, it does not separate files by job within a submission. But when used with $(Process) , it can be used to separate files by run. For example, consider this output statement: output = my-output-file-$(Cluster)-$(Process).out For one particular run, it might result in output filenames like my-output-file-2444-0.out , myoutput-file-2444-1.out , myoutput-file-2444-2.out , etc. However, the next run would have different filenames, replacing 2444 with the new Cluster number of that run. Using $(Process) and $(Cluster) in Other Statements \u00b6 The $(Cluster) and $(Process) variables can be used in any submit file statement, although they are useful in some kinds of submit file statements and not really for others. For example, consider using $(Cluster) or $(Process) in each of the below: log transfer_input_files transfer_output_files arguments Unfortunately, HTCondor does not easily let you perform math on the $(Process) number when using it. So, for example, if you use $(Process) as a numeric argument to a command, it will always result in jobs getting the arguments 0, 1, 2, and so on. If you have control over your program and the way in which it uses command-line arguments, then you are fine. Otherwise, you might need a solution like those in the next exercises. (Optional) Defining JobBatchName for Tracking \u00b6 It is possible to define arbitrary attributes in your submit file, and that one purpose of such attributes is to track or report on different jobs separately. In this optional exercise, you will see how this technique can be used. Once again, we will use sleep jobs, so that your jobs remain in the queue long enough to experiment on. Create a submit file that runs sleep 120 . Instead of a single queue statement, write this: jobbatchname = 1 queue 5 Submit the submit file to HTCondor. Now, quickly edit the submit file to instead say: jobbatchname = 2 Submit the file again. Check on the submissions using a normal condor_q and condor_q -nobatch . Of course, your special attribute does not appear in the condor_q -nobatch output, but it is present in the condor_q output and in each job\u2019s ClassAd. You can see the effect of the attribute by limiting your condor_q output to one type of job or another. First, run this command: username@ap1 $ condor_q -constraint 'JobBatchName == \"1\"' Do you get the output that you expected? Using the example command above, how would you list your other five jobs? (There will be more on how to use HTCondor constraints in later exercises.)","title":"2.2 - Use queue N, $(Cluster), and $(Process)"},{"location":"materials/htcondor/part2-ex2-queue-n/#htc-exercise-22-use-queue-n-cluster-and-process","text":"","title":"HTC Exercise 2.2: Use queue N, $(Cluster), and $(Process)"},{"location":"materials/htcondor/part2-ex2-queue-n/#background","text":"Suppose you have a program that you want to run many times with different arguments each time. With what you know so far, you have a couple of choices: Write one submit file; submit one job, change the argument in the submit file, submit another job, change the submit file, \u2026 Write many submit files that are nearly identical except for the program argument Neither of these options seems very satisfying. Fortunately, HTCondor's queue statement is here to help!","title":"Background"},{"location":"materials/htcondor/part2-ex2-queue-n/#exercise-goal","text":"The goal of the next several exercises is to learn to submit many jobs from a single HTCondor queue statement, and to control things like filenames and arguments on a per-job basis when doing so.","title":"Exercise Goal"},{"location":"materials/htcondor/part2-ex2-queue-n/#running-many-jobs-with-one-queue-statement","text":"Example Here is a C program that uses a stochastic (random) method to estimate the value of \u03c0. The single argument to the program is the number of samples to take. More samples should result in better estimates! #include #include #include int main ( int argc , char * argv []) { struct timeval my_timeval ; int iterations = 0 ; int inside_circle = 0 ; int i ; double x , y , pi_estimate ; gettimeofday ( & my_timeval , NULL ); srand48 ( my_timeval . tv_sec ^ my_timeval . tv_usec ); if ( argc == 2 ) { iterations = atoi ( argv [ 1 ]); } else { printf ( \"usage: circlepi ITERATIONS \\n \" ); exit ( 1 ); } for ( i = 0 ; i < iterations ; i ++ ) { x = ( drand48 () - 0.5 ) * 2.0 ; y = ( drand48 () - 0.5 ) * 2.0 ; if ((( x * x ) + ( y * y )) <= 1.0 ) { inside_circle ++ ; } } pi_estimate = 4.0 * (( double ) inside_circle / ( double ) iterations ); printf ( \"%d iterations, %d inside; pi = %f \\n \" , iterations , inside_circle , pi_estimate ); return 0 ; } In a new directory for this exercise, create and save the code to a file named circlepi.c Compile the code (we will cover this in more detail during the Software lecture): username@ap1 $ gcc -o circlepi circlepi.c Test the program with just 1000 samples: username@ap1 $ ./circlepi 1000 Now suppose that you want to run the program many times, to produce many estimates. To do so, we can tell HTCondor how many jobs to \"queue up\" via the queue statement we've been putting at the end of each of our submit files. Let\u2019s see how it works: Write a normal submit file for this program Pass 1 million ( 1000000 ) as the command line argument to circlepi Make sure to include log , output , and error (with filenames like circlepi.log ), and request_* lines At the end of the file, write queue 3 instead of just queue (\"queue 3 jobs\" vs. \"queue a job\"). Submit the file. Note the slightly different message from condor_submit : 3 job(s) submitted to cluster *NNNN*. Before the jobs execute, look at the job queue to see the multiple jobs Here is some sample condor_q -nobatch output: ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 10228.0 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.1 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 10228.2 cat 7/25 11:57 0+00:00:00 I 0 0.7 circlepi 1000000000 In this sample, all three jobs are part of cluster 10228 , but the first job was assigned process 0 , the second job was assigned process 1 , and the third one was assigned process 2 . (Programmers like to start counting from 0.) Now we can understand what the first column in the output, the job ID , represents. It is a job\u2019s cluster number , a dot ( . ), and the job\u2019s process number . So in the example above, the job ID of the second job is 10228.1 . Pop Quiz: Do you remember how to ask HTCondor's queue to list the status of all of the jobs from one cluster? How about one specific job ID?","title":"Running Many Jobs With One queue Statement"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-queue-n-with-output","text":"When all three jobs in your single cluster are finished, examine the resulting files. What is in the output file? What is in the error file? (hopefully it is empty!) What is in the log file? Look carefully at the job IDs in each event. Is this what you expected? Is it what you wanted? If the output is not what you expected, what do you think happened?","title":"Using queue N With Output"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-process-to-distinguish-jobs","text":"As you saw with the experiment above, each job ended up overwriting the same output and error filenames in the submission directory. After all, we didn't tell it to behave any differently when it ran three jobs. We need a way to separate output (and error) files per job that is queued , not just for the whole cluster of jobs. Fortunately, HTCondor has a way to separate the files easily. When processing a submit file, HTCondor will replace any instance of $(Process) with the process number of the job, for each job that is queued. For example, you can use the $(Process) variable to define a separate output file name for each job: output = my-output-file-$(Process).out queue 10 Even though the output filename is defined only once, HTCondor will create separate output filenames for each job: First job my-output-file-0.out Second job my-output-file-1.out Third job my-output-file-2.out ... ... Last (tenth) job my-output-file-9.out Let\u2019s see how this works for our program that estimates \u03c0. In your submit file, change the definitions of output and error to use $(Process) in the filename, similar to the example above. Delete any standard output, standard error, and log files from previous runs. Submit the updated file. When all three jobs are finished, examine the resulting files again. How many files are there of each type? What are their names? Is this what you expected? Is it what you wanted from the \u03c0 estimation process?","title":"Using $(Process) to Distinguish Jobs"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-cluster-to-separate-files-across-runs","text":"With $(Process) , you can get separate output (and error) filenames for each job within a run. However, the next time you submit the same file, all of the output and error files are overwritten by new ones created by the new jobs. Maybe this is the behavior that you want. But sometimes, you may want to separate files by run, as well. In addition to $(Process) , there is also a $(Cluster) variable that you can use in your submit files. It works just like $(Process) , except it is replaced with the cluster number of the entire submission. Because the cluster number is the same for all jobs within a single submission, it does not separate files by job within a submission. But when used with $(Process) , it can be used to separate files by run. For example, consider this output statement: output = my-output-file-$(Cluster)-$(Process).out For one particular run, it might result in output filenames like my-output-file-2444-0.out , myoutput-file-2444-1.out , myoutput-file-2444-2.out , etc. However, the next run would have different filenames, replacing 2444 with the new Cluster number of that run.","title":"Using $(Cluster) to Separate Files Across Runs"},{"location":"materials/htcondor/part2-ex2-queue-n/#using-process-and-cluster-in-other-statements","text":"The $(Cluster) and $(Process) variables can be used in any submit file statement, although they are useful in some kinds of submit file statements and not really for others. For example, consider using $(Cluster) or $(Process) in each of the below: log transfer_input_files transfer_output_files arguments Unfortunately, HTCondor does not easily let you perform math on the $(Process) number when using it. So, for example, if you use $(Process) as a numeric argument to a command, it will always result in jobs getting the arguments 0, 1, 2, and so on. If you have control over your program and the way in which it uses command-line arguments, then you are fine. Otherwise, you might need a solution like those in the next exercises.","title":"Using $(Process) and $(Cluster) in Other Statements"},{"location":"materials/htcondor/part2-ex2-queue-n/#optional-defining-jobbatchname-for-tracking","text":"It is possible to define arbitrary attributes in your submit file, and that one purpose of such attributes is to track or report on different jobs separately. In this optional exercise, you will see how this technique can be used. Once again, we will use sleep jobs, so that your jobs remain in the queue long enough to experiment on. Create a submit file that runs sleep 120 . Instead of a single queue statement, write this: jobbatchname = 1 queue 5 Submit the submit file to HTCondor. Now, quickly edit the submit file to instead say: jobbatchname = 2 Submit the file again. Check on the submissions using a normal condor_q and condor_q -nobatch . Of course, your special attribute does not appear in the condor_q -nobatch output, but it is present in the condor_q output and in each job\u2019s ClassAd. You can see the effect of the attribute by limiting your condor_q output to one type of job or another. First, run this command: username@ap1 $ condor_q -constraint 'JobBatchName == \"1\"' Do you get the output that you expected? Using the example command above, how would you list your other five jobs? (There will be more on how to use HTCondor constraints in later exercises.)","title":"(Optional) Defining JobBatchName for Tracking"},{"location":"materials/htcondor/part2-ex3-queue-from/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } HTC Exercise 2.3: Submit with \u201cqueue from\u201d \u00b6 Exercise Goals \u00b6 In this exercise and the next one, you will explore more ways to use a single submit file to submit many jobs . The goal of this exercise is to submit many jobs from a single submit file by using the queue ... from syntax to read variable values from a file. Background \u00b6 In all cases of submitting many jobs from a single submit file, the key questions are: What makes each job unique? In other words, there is one job per _____? So, how should you tell HTCondor to distinguish each job? For queue *N* , jobs are distinguished simply by the built-in \"process\" variable. But with the remaining queue forms, you help HTCondor distinguish jobs by other, more meaningful custom variables. Counting Words in Files \u00b6 Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. As mentioned in the lecture, HTCondor provides many ways to submit jobs for this task. You could create a separate submit file for each book, and submit all of the files manually, but you'd have a lot of file lines to modify each time (in particular, all five of the last lines before queue below): executable = freq.py request_memory = 1GB request_disk = 20MB should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = AAiW.txt arguments = AAiW.txt output = AAiW.out error = AAiW.err log = AAiW.log queue This would be overly verbose and tedious. Let's do better. Queue Jobs From a List of Values \u00b6 Suppose we want to modify our word-frequency analysis from a previous exercise so that it outputs only the most common N words of a document. However, we want to experiment with different values of N . For this analysis, we will have a new version of the word-frequency counting script. First, we need a new version of the word counting program so that it accepts an extra number as a command line argument and outputs only that many of the most common words. Here is the new code (it's still not important that you understand this code): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 3 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA NUM_WORDS' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] num_words = int ( sys . argv [ 2 ]) words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words [ - num_words :]: print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To submit this program with a collection of two variable values for each run, one for the number of top words and one for the filename: Save the script as wordcount-top-n.py . Download and unpack some books from Project Gutenberg: user@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/books.zip user@ap1 $ unzip books.zip Create a new submit file (or base it off a previous one!) named wordcount-top.sub , including memory and disk requests of 20 MB. All of the jobs will use the same executable and log statements. Update other statements to work with two variables, book and n : output = $(book)_top_$(n).out error = $(book)_top_$(n).err transfer_input_files = $(book) arguments = \"$(book) $(n)\" queue book, n from books_n.txt Note especially the changes to the queue statement; it now tells HTCondor to read a separate text file of pairs of values, which will be assigned to book and n respectively. Create the separate text file of job variable values and save it as books_n.txt : AAiW.txt, 10 AAiW.txt, 25 AAiW.txt, 50 PandP.txt, 10 PandP.txt, 25 PandP.txt, 50 TAoSH.txt, 10 TAoSH.txt, 25 TAoSH.txt, 50 Note that we used 3 different values for n for each book. Submit the file Do a quick sanity check: How many jobs were submitted? How many log, output, and error files were created? Extra Challenge 1 \u00b6 You may have noticed that the output of these jobs has a messy naming convention. Because our macros resolve to the filenames, including their extension (e.g., AAiW.txt ), the output filenames contain with multiple extensions (e.g., AAiW.txt.err ). Although the extra extension is acceptable, it makes the filenames harder to read and possibly organize. Change your submit file and variable file for this exercise so that the output filenames do not include the .txt extension.","title":"2.3 - Use queue from with custom variables"},{"location":"materials/htcondor/part2-ex3-queue-from/#htc-exercise-23-submit-with-queue-from","text":"","title":"HTC Exercise 2.3: Submit with \u201cqueue from\u201d"},{"location":"materials/htcondor/part2-ex3-queue-from/#exercise-goals","text":"In this exercise and the next one, you will explore more ways to use a single submit file to submit many jobs . The goal of this exercise is to submit many jobs from a single submit file by using the queue ... from syntax to read variable values from a file.","title":"Exercise Goals"},{"location":"materials/htcondor/part2-ex3-queue-from/#background","text":"In all cases of submitting many jobs from a single submit file, the key questions are: What makes each job unique? In other words, there is one job per _____? So, how should you tell HTCondor to distinguish each job? For queue *N* , jobs are distinguished simply by the built-in \"process\" variable. But with the remaining queue forms, you help HTCondor distinguish jobs by other, more meaningful custom variables.","title":"Background"},{"location":"materials/htcondor/part2-ex3-queue-from/#counting-words-in-files","text":"Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. As mentioned in the lecture, HTCondor provides many ways to submit jobs for this task. You could create a separate submit file for each book, and submit all of the files manually, but you'd have a lot of file lines to modify each time (in particular, all five of the last lines before queue below): executable = freq.py request_memory = 1GB request_disk = 20MB should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = AAiW.txt arguments = AAiW.txt output = AAiW.out error = AAiW.err log = AAiW.log queue This would be overly verbose and tedious. Let's do better.","title":"Counting Words in Files"},{"location":"materials/htcondor/part2-ex3-queue-from/#queue-jobs-from-a-list-of-values","text":"Suppose we want to modify our word-frequency analysis from a previous exercise so that it outputs only the most common N words of a document. However, we want to experiment with different values of N . For this analysis, we will have a new version of the word-frequency counting script. First, we need a new version of the word counting program so that it accepts an extra number as a command line argument and outputs only that many of the most common words. Here is the new code (it's still not important that you understand this code): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 3 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA NUM_WORDS' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] num_words = int ( sys . argv [ 2 ]) words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words [ - num_words :]: print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To submit this program with a collection of two variable values for each run, one for the number of top words and one for the filename: Save the script as wordcount-top-n.py . Download and unpack some books from Project Gutenberg: user@ap1 $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool20/books.zip user@ap1 $ unzip books.zip Create a new submit file (or base it off a previous one!) named wordcount-top.sub , including memory and disk requests of 20 MB. All of the jobs will use the same executable and log statements. Update other statements to work with two variables, book and n : output = $(book)_top_$(n).out error = $(book)_top_$(n).err transfer_input_files = $(book) arguments = \"$(book) $(n)\" queue book, n from books_n.txt Note especially the changes to the queue statement; it now tells HTCondor to read a separate text file of pairs of values, which will be assigned to book and n respectively. Create the separate text file of job variable values and save it as books_n.txt : AAiW.txt, 10 AAiW.txt, 25 AAiW.txt, 50 PandP.txt, 10 PandP.txt, 25 PandP.txt, 50 TAoSH.txt, 10 TAoSH.txt, 25 TAoSH.txt, 50 Note that we used 3 different values for n for each book. Submit the file Do a quick sanity check: How many jobs were submitted? How many log, output, and error files were created?","title":"Queue Jobs From a List of Values"},{"location":"materials/htcondor/part2-ex3-queue-from/#extra-challenge-1","text":"You may have noticed that the output of these jobs has a messy naming convention. Because our macros resolve to the filenames, including their extension (e.g., AAiW.txt ), the output filenames contain with multiple extensions (e.g., AAiW.txt.err ). Although the extra extension is acceptable, it makes the filenames harder to read and possibly organize. Change your submit file and variable file for this exercise so that the output filenames do not include the .txt extension.","title":"Extra Challenge 1"},{"location":"materials/htcondor/part2-ex4-queue-matching/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus HTC Exercise 2.4: Submit With \u201cqueue matching\u201d \u00b6 Exercise Goal \u00b6 The goal of this exercise is to submit many jobs from a single submit file by using the queue ... matching syntax to submit jobs with variable values derived from files in the current directory which match a specified pattern. Counting Words in Files \u00b6 Returning to our book word-counting example, let's pretend that instead of three books, we have an entire library. While we could list all of the text files in a books.txt file and use queue book from books.txt , it could be a tedious process, especially for tens of thousands of files. Luckily HTCondor provides a mechanism for submitting jobs based on pattern-matched files. Queue Jobs By Matching Filenames \u00b6 This is an example of a common scenario: We want to run one job per file, where the filenames match a certain consistent pattern. The queue ... matching statement is made for this scenario. Let\u2019s see this in action. First, here is a new version of the script (note, we removed the 'top n words' restriction): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words : print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To use the script: Create and save this script as wordcount.py . Verify the script by running it on one book manually. Create a new submit file to submit one job (pick a book file and model your submit file off of the one above) Modify the following submit file statements to work for all books: transfer_input_files = $(book) arguments = $(book) output = $(book).out error = $(book).err queue book matching *.txt Note As always, the order of statements in a submit file does not matter, except that the queue statement should be last. Also note that any submit file variable name (here, book , but true for process and all others) may be used in any mixture of upper- and lowercase letters. Submit the jobs. HTCondor uses the queue ... matching statement to look for files in the submit directory that match the given pattern, then queues one job per match. For each job, the given variable (e.g., book here) is assigned the name of the matching file, so that it can be used in output , error , and other statements. The result is the same as if we had written out a much longer submit file: ... transfer_input_files = AAiW.txt arguments = \"AAiW.txt\" output = AAiW.txt.out error = AAiW.txt.err queue transfer_input_files = PandP.txt arguments = \"PandP.txt\" output = PandP.txt.out error = PandP.txt.err queue transfer_input_files = TAoSH.txt arguments = \"TAoSH.txt\" output = TAoSH.txt.out error = TAoSH.txt.err queue ... How many jobs were created? Is this what you expected? If you ran this in the same directory as Exercise 2.3, you may have noticed that a job was submitted for the books_n.txt file that holds the variable values in the queue from statement. Beware the dangers of matching more files than intended! One solution may be to put all of the books into an books directory and queue matching books/*.txt . Can you think of other solutions? If you have time, try one! Extra Challenge 1 \u00b6 In the example above, you used a single log file for all three jobs. HTCondor handles this situation with no problem; each job writes its events into the log file without getting in the way of other events and other jobs. But as you may have seen, it may be difficult for a person to understand the events for any particular job in the combined log file. Create a new submit file that works just like the one above, except that each job writes its own log file. Extra Challenge 2 \u00b6 Between this exercise and the previous one, you have explored two of the three primary queue statements. How would you use the queue in ... list statement to accomplish the same thing(s) as one or both of the exercises?","title":"Bonus Exercise 2.4 - Use queue matching with a custom variable"},{"location":"materials/htcondor/part2-ex4-queue-matching/#bonus-htc-exercise-24-submit-with-queue-matching","text":"","title":"Bonus HTC Exercise 2.4: Submit With \u201cqueue matching\u201d"},{"location":"materials/htcondor/part2-ex4-queue-matching/#exercise-goal","text":"The goal of this exercise is to submit many jobs from a single submit file by using the queue ... matching syntax to submit jobs with variable values derived from files in the current directory which match a specified pattern.","title":"Exercise Goal"},{"location":"materials/htcondor/part2-ex4-queue-matching/#counting-words-in-files","text":"Returning to our book word-counting example, let's pretend that instead of three books, we have an entire library. While we could list all of the text files in a books.txt file and use queue book from books.txt , it could be a tedious process, especially for tens of thousands of files. Luckily HTCondor provides a mechanism for submitting jobs based on pattern-matched files.","title":"Counting Words in Files"},{"location":"materials/htcondor/part2-ex4-queue-matching/#queue-jobs-by-matching-filenames","text":"This is an example of a common scenario: We want to run one job per file, where the filenames match a certain consistent pattern. The queue ... matching statement is made for this scenario. Let\u2019s see this in action. First, here is a new version of the script (note, we removed the 'top n words' restriction): #!/usr/bin/env python3 import os import sys import operator if len ( sys . argv ) != 2 : print ( f 'Usage: { os . path . basename ( sys . argv [ 0 ]) } DATA' ) sys . exit ( 1 ) input_filename = sys . argv [ 1 ] words = {} with open ( input_filename , 'r' ) as my_file : for line in my_file : line_words = line . split () for word in line_words : if word in words : words [ word ] += 1 else : words [ word ] = 1 sorted_words = sorted ( words . items (), key = operator . itemgetter ( 1 )) for word in sorted_words : print ( f ' { word [ 0 ] } { word [ 1 ] : 8d } ' ) To use the script: Create and save this script as wordcount.py . Verify the script by running it on one book manually. Create a new submit file to submit one job (pick a book file and model your submit file off of the one above) Modify the following submit file statements to work for all books: transfer_input_files = $(book) arguments = $(book) output = $(book).out error = $(book).err queue book matching *.txt Note As always, the order of statements in a submit file does not matter, except that the queue statement should be last. Also note that any submit file variable name (here, book , but true for process and all others) may be used in any mixture of upper- and lowercase letters. Submit the jobs. HTCondor uses the queue ... matching statement to look for files in the submit directory that match the given pattern, then queues one job per match. For each job, the given variable (e.g., book here) is assigned the name of the matching file, so that it can be used in output , error , and other statements. The result is the same as if we had written out a much longer submit file: ... transfer_input_files = AAiW.txt arguments = \"AAiW.txt\" output = AAiW.txt.out error = AAiW.txt.err queue transfer_input_files = PandP.txt arguments = \"PandP.txt\" output = PandP.txt.out error = PandP.txt.err queue transfer_input_files = TAoSH.txt arguments = \"TAoSH.txt\" output = TAoSH.txt.out error = TAoSH.txt.err queue ... How many jobs were created? Is this what you expected? If you ran this in the same directory as Exercise 2.3, you may have noticed that a job was submitted for the books_n.txt file that holds the variable values in the queue from statement. Beware the dangers of matching more files than intended! One solution may be to put all of the books into an books directory and queue matching books/*.txt . Can you think of other solutions? If you have time, try one!","title":"Queue Jobs By Matching Filenames"},{"location":"materials/htcondor/part2-ex4-queue-matching/#extra-challenge-1","text":"In the example above, you used a single log file for all three jobs. HTCondor handles this situation with no problem; each job writes its events into the log file without getting in the way of other events and other jobs. But as you may have seen, it may be difficult for a person to understand the events for any particular job in the combined log file. Create a new submit file that works just like the one above, except that each job writes its own log file.","title":"Extra Challenge 1"},{"location":"materials/htcondor/part2-ex4-queue-matching/#extra-challenge-2","text":"Between this exercise and the previous one, you have explored two of the three primary queue statements. How would you use the queue in ... list statement to accomplish the same thing(s) as one or both of the exercises?","title":"Extra Challenge 2"},{"location":"materials/osg/part1-ex1-login-scp/","text":"OSG Exercise 1.1: Log In to the OSPool Access Point \u00b6 The main goal of this exercise is to log in to an Open Science Pool Access Point so that you can start submitting jobs into the OSPool. But before doing that, you will first prepare a file on Monday\u2018s Access Point to copy to the OSPool Access Point. Then you will learn how to efficiently copy files between the Access Points. If you have trouble getting ssh access to the OSPool Access Point, ask the instructors right away! Gaining access is critical for all remaining exercises. Part 1: On the PATh Access Point \u00b6 The first few sections below are to be completed on ap1.facility.path-cc.io , the PATh Access Point. This is still the same Access Point you have been using since yesterday. Preparing files for transfer \u00b6 When transferring files between computers, it\u2019s best to limit the number of files as well as their size. Smaller files transfer more quickly and, if your network connection fails, restarting the transfer is less painful than it would be if you were transferring large files. Archiving tools (WinZip, 7zip, Archive Utility, etc.) can compress the size of your files and place them into a single, smaller archive file. The Unix tar command is a one-stop shop for creating, extracting, and viewing the contents of tar archives (called tarballs ). Its usage is as follows: To create a tarball named containing , use the following command: $ tar -czvf Where should end in .tar.gz and can be a list of any number of files and/or folders, separated by spaces. To extract the files from a tarball into the current directory: $ tar -xzvf To list the files within a tarball: $ tar -tzvf Comparing compressed sizes \u00b6 You can adjust the level of compression of tar by prepending your command with GZIP=-- , where can be either fast for the least compression, or best for the most compression (the default compression is between best and fast ). While still logged in to ap1.facility.path-cc.io : Create and change into a new folder for this exercise, for example osg-ex11 Use wget to download the following files from our web server: Text file: http://proxy.chtc.wisc.edu/SQUID/osgschool21/random_text Archive: http://proxy.chtc.wisc.edu/SQUID/osgschool21/pdbaa.tar.gz Image: http://proxy.chtc.wisc.edu/SQUID/osgschool21/obligatory_cat.jpg Use tar on each file and use ls -l to compare the sizes of the original file and the compressed version. Which files were compressed the least? Why? Part 2: On the Open Science Pool Access Point \u00b6 For many of the remaining exercises, you will be using an OSPool Access Point, ap40.uw.osg-htc.org , which submits jobs into the OSPool. To log in to the OSPool Access Point, use the same username (and SSH key, if you did that) as on ap1 . If you have any issues logging in to ap40.uw.osg-htc.org , please ask for help right away! So please ssh in to the server and take a look around: Log in using ssh USERNAME@ap40.uw.osg-htc.org (substitute your own username) Try some Linux and HTCondor commands; for example: Linux commands: hostname , pwd , ls , and so on What is the operating system? uname and (in this case) cat /etc/redhat-release HTCondor commands: condor_version , condor_q , condor_status -total Transferring files \u00b6 In the next exercise, you will submit the same kind of job as in the previous exercise. Wouldn\u2019t it be nice to copy the files instead of starting from scratch? And in general, being able to copy files between servers is helpful, so let\u2019s explore a way to do that. Using secure copy \u00b6 Secure copy ( scp ) is a command based on SSH that lets you securely copy files between two different servers. It takes similar arguments to the Unix cp command but also takes additional information about servers. Its general form is like this: scp ... [username@]: may be omitted if you want to copy your sources to your remote home directory and [username@] may be omitted if your usernames are the same across both servers. For example, if you are logged in to ap40.uw.osg-htc.org and wanted to copy the file foo from your current directory to your home directory on ap1.facility.path-cc.io , and if your usernames are the same on both servers, the command would look like this: $ scp foo ap1.facility.path-cc.io: Additionally, you could pull files from ap1.facility.path-cc.io to ap40.uw.osg-htc.org . The following command copies bar from your home directory on ap1.facility.path-cc.io to your current directory on ap40.uw.osg-htc.org ; and in this case, the username for ap1 is specified: $ scp USERNAME@ap1.facility.path-cc.io:bar . Also, you can copy folders between servers using the -r option. If you kept all your files from the HTCondor exercise 1.3 in a folder named htc-1.3 on ap1.facility.path-cc.io , you could use the following command to copy them to your home directory on ap40.uw.osg-htc.org : $ scp -r USERNAME@ap1.facility.path-cc.io:htc-1.3 . Secure copy to your laptop \u00b6 During your research, you may need to transfer output files from your submit server to inspect them on your personal computer, which can also be done with scp ! To use scp on your laptop, follow the instructions relevant to your computer\u2018s operating system: Mac and Linux users \u00b6 scp should be included by default and available via the terminal on both Mac and Linux operating systems. Windows users \u00b6 WinSCP is an scp client for Windows operating systems. Install WinSCP from https://winscp.net/eng/index.php Next exercise \u00b6 Once completed, move onto the next exercise: Running jobs in the OSG","title":"1.1 - Log in to the OSPool Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#osg-exercise-11-log-in-to-the-ospool-access-point","text":"The main goal of this exercise is to log in to an Open Science Pool Access Point so that you can start submitting jobs into the OSPool. But before doing that, you will first prepare a file on Monday\u2018s Access Point to copy to the OSPool Access Point. Then you will learn how to efficiently copy files between the Access Points. If you have trouble getting ssh access to the OSPool Access Point, ask the instructors right away! Gaining access is critical for all remaining exercises.","title":"OSG Exercise 1.1: Log In to the OSPool Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#part-1-on-the-path-access-point","text":"The first few sections below are to be completed on ap1.facility.path-cc.io , the PATh Access Point. This is still the same Access Point you have been using since yesterday.","title":"Part 1: On the PATh Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#preparing-files-for-transfer","text":"When transferring files between computers, it\u2019s best to limit the number of files as well as their size. Smaller files transfer more quickly and, if your network connection fails, restarting the transfer is less painful than it would be if you were transferring large files. Archiving tools (WinZip, 7zip, Archive Utility, etc.) can compress the size of your files and place them into a single, smaller archive file. The Unix tar command is a one-stop shop for creating, extracting, and viewing the contents of tar archives (called tarballs ). Its usage is as follows: To create a tarball named containing , use the following command: $ tar -czvf Where should end in .tar.gz and can be a list of any number of files and/or folders, separated by spaces. To extract the files from a tarball into the current directory: $ tar -xzvf To list the files within a tarball: $ tar -tzvf ","title":"Preparing files for transfer"},{"location":"materials/osg/part1-ex1-login-scp/#comparing-compressed-sizes","text":"You can adjust the level of compression of tar by prepending your command with GZIP=-- , where can be either fast for the least compression, or best for the most compression (the default compression is between best and fast ). While still logged in to ap1.facility.path-cc.io : Create and change into a new folder for this exercise, for example osg-ex11 Use wget to download the following files from our web server: Text file: http://proxy.chtc.wisc.edu/SQUID/osgschool21/random_text Archive: http://proxy.chtc.wisc.edu/SQUID/osgschool21/pdbaa.tar.gz Image: http://proxy.chtc.wisc.edu/SQUID/osgschool21/obligatory_cat.jpg Use tar on each file and use ls -l to compare the sizes of the original file and the compressed version. Which files were compressed the least? Why?","title":"Comparing compressed sizes"},{"location":"materials/osg/part1-ex1-login-scp/#part-2-on-the-open-science-pool-access-point","text":"For many of the remaining exercises, you will be using an OSPool Access Point, ap40.uw.osg-htc.org , which submits jobs into the OSPool. To log in to the OSPool Access Point, use the same username (and SSH key, if you did that) as on ap1 . If you have any issues logging in to ap40.uw.osg-htc.org , please ask for help right away! So please ssh in to the server and take a look around: Log in using ssh USERNAME@ap40.uw.osg-htc.org (substitute your own username) Try some Linux and HTCondor commands; for example: Linux commands: hostname , pwd , ls , and so on What is the operating system? uname and (in this case) cat /etc/redhat-release HTCondor commands: condor_version , condor_q , condor_status -total","title":"Part 2: On the Open Science Pool Access Point"},{"location":"materials/osg/part1-ex1-login-scp/#transferring-files","text":"In the next exercise, you will submit the same kind of job as in the previous exercise. Wouldn\u2019t it be nice to copy the files instead of starting from scratch? And in general, being able to copy files between servers is helpful, so let\u2019s explore a way to do that.","title":"Transferring files"},{"location":"materials/osg/part1-ex1-login-scp/#using-secure-copy","text":"Secure copy ( scp ) is a command based on SSH that lets you securely copy files between two different servers. It takes similar arguments to the Unix cp command but also takes additional information about servers. Its general form is like this: scp ... [username@]: may be omitted if you want to copy your sources to your remote home directory and [username@] may be omitted if your usernames are the same across both servers. For example, if you are logged in to ap40.uw.osg-htc.org and wanted to copy the file foo from your current directory to your home directory on ap1.facility.path-cc.io , and if your usernames are the same on both servers, the command would look like this: $ scp foo ap1.facility.path-cc.io: Additionally, you could pull files from ap1.facility.path-cc.io to ap40.uw.osg-htc.org . The following command copies bar from your home directory on ap1.facility.path-cc.io to your current directory on ap40.uw.osg-htc.org ; and in this case, the username for ap1 is specified: $ scp USERNAME@ap1.facility.path-cc.io:bar . Also, you can copy folders between servers using the -r option. If you kept all your files from the HTCondor exercise 1.3 in a folder named htc-1.3 on ap1.facility.path-cc.io , you could use the following command to copy them to your home directory on ap40.uw.osg-htc.org : $ scp -r USERNAME@ap1.facility.path-cc.io:htc-1.3 .","title":"Using secure copy"},{"location":"materials/osg/part1-ex1-login-scp/#secure-copy-to-your-laptop","text":"During your research, you may need to transfer output files from your submit server to inspect them on your personal computer, which can also be done with scp ! To use scp on your laptop, follow the instructions relevant to your computer\u2018s operating system:","title":"Secure copy to your laptop"},{"location":"materials/osg/part1-ex1-login-scp/#mac-and-linux-users","text":"scp should be included by default and available via the terminal on both Mac and Linux operating systems.","title":"Mac and Linux users"},{"location":"materials/osg/part1-ex1-login-scp/#windows-users","text":"WinSCP is an scp client for Windows operating systems. Install WinSCP from https://winscp.net/eng/index.php","title":"Windows users"},{"location":"materials/osg/part1-ex1-login-scp/#next-exercise","text":"Once completed, move onto the next exercise: Running jobs in the OSG","title":"Next exercise"},{"location":"materials/osg/part1-ex2-submit-osg/","text":"OSG Exercise 1.2: Running Jobs in OSPool \u00b6 The goal of this exercise is to map the physical locations of some Execution Points in the OSPool. We will provide the executable and associated data, so your job will be to write a submit file that queues multiple jobs. Once complete, you will manually collate the results. Where in the world are my jobs? \u00b6 To find the physical location of the computers your jobs our running on, you will use a method called geolocation . Geolocation uses a registry to match a computer\u2019s network address to an approximate latitude and longitude. Geolocating several Execution Points \u00b6 Now, let\u2019s try to remember some basic HTCondor ideas from the HTC exercises: Log in to ap40.uw.osg-htc.org if you have not yet. Create and change into a new folder for this exercise, for example osg-ex12 Download the geolocation code: $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool21/location-wrapper.sh \\ http://proxy.chtc.wisc.edu/SQUID/osgschool21/wn-geoip.tar.gz You will be using location-wrapper.sh as your executable and wn-geoip.tar.gz as an input file. Create a submit file that queues fifty jobs that run location-wrapper.sh , transfers wn-geoip.tar.gz as an input file, and uses the $(Process) macro to write different output and error files. Also, add the following requirement to the submit file (it\u2019s not important to know what it does): requirements = (HAS_CVMFS_oasis_opensciencegrid_org == TRUE) && (IsOsgVoContainer =!= True) Try to do this step without looking at materials from the earlier exercises. But if you are stuck, see HTC Exercise 2.2 . Submit your jobs and wait for the results Collating your results \u00b6 Now that you have your results, it\u2019s time to summarize them. Rather than inspecting each output file individually, you can use the cat command to print the results from all of your output files at once. If all of your output files have the format location-#.out (e.g., location-10.out ), your command will look something like this: $ cat location-*.out The * is a wildcard so the above cat command runs on all files that start with location- and end in .out . Additionally, you can use cat in combination with the sort and uniq commands using \"pipes\" ( | ) to print only the unique results: $ cat location-*.out | sort | uniq Mapping your results \u00b6 To visualize the locations of the Execution Points that your jobs ran on, you will be using http://www.mapcustomizer.com/ . Copy and paste the collated results into the text box that pops up when clicking on the 'Bulk Entry' button on the right-hand side. Where did your jobs run? Next exercise \u00b6 Once completed, move onto the next exercise: Hardware Differences in the OSG Extra Challenge: Cleaning up your submit directory \u00b6 If you run ls in the directory from which you submitted your job, you may see that you now have thousands of files! Proper data management starts to become a requirement as you start to develop true HTC workflows; it may be helpful to separate your submit files, code, and input data from your output data. Try editing your submit file so that all your output and error files are saved to separate directories within your submit directory. Tip Experiment with fewer job submissions until you\u2019re confident you have it right, then go back to submitting 500 jobs. Remember: Test small and scale up! Submit your file and track the status of your jobs. Did your jobs complete successfully with output and error files saved in separate directories? If not, can you find any useful information in the job logs or hold messages? If you get stuck, review the slides from Tuesday .","title":"1.2 - Running jobs in the OSPool"},{"location":"materials/osg/part1-ex2-submit-osg/#osg-exercise-12-running-jobs-in-ospool","text":"The goal of this exercise is to map the physical locations of some Execution Points in the OSPool. We will provide the executable and associated data, so your job will be to write a submit file that queues multiple jobs. Once complete, you will manually collate the results.","title":"OSG Exercise 1.2: Running Jobs in OSPool"},{"location":"materials/osg/part1-ex2-submit-osg/#where-in-the-world-are-my-jobs","text":"To find the physical location of the computers your jobs our running on, you will use a method called geolocation . Geolocation uses a registry to match a computer\u2019s network address to an approximate latitude and longitude.","title":"Where in the world are my jobs?"},{"location":"materials/osg/part1-ex2-submit-osg/#geolocating-several-execution-points","text":"Now, let\u2019s try to remember some basic HTCondor ideas from the HTC exercises: Log in to ap40.uw.osg-htc.org if you have not yet. Create and change into a new folder for this exercise, for example osg-ex12 Download the geolocation code: $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool21/location-wrapper.sh \\ http://proxy.chtc.wisc.edu/SQUID/osgschool21/wn-geoip.tar.gz You will be using location-wrapper.sh as your executable and wn-geoip.tar.gz as an input file. Create a submit file that queues fifty jobs that run location-wrapper.sh , transfers wn-geoip.tar.gz as an input file, and uses the $(Process) macro to write different output and error files. Also, add the following requirement to the submit file (it\u2019s not important to know what it does): requirements = (HAS_CVMFS_oasis_opensciencegrid_org == TRUE) && (IsOsgVoContainer =!= True) Try to do this step without looking at materials from the earlier exercises. But if you are stuck, see HTC Exercise 2.2 . Submit your jobs and wait for the results","title":"Geolocating several Execution Points"},{"location":"materials/osg/part1-ex2-submit-osg/#collating-your-results","text":"Now that you have your results, it\u2019s time to summarize them. Rather than inspecting each output file individually, you can use the cat command to print the results from all of your output files at once. If all of your output files have the format location-#.out (e.g., location-10.out ), your command will look something like this: $ cat location-*.out The * is a wildcard so the above cat command runs on all files that start with location- and end in .out . Additionally, you can use cat in combination with the sort and uniq commands using \"pipes\" ( | ) to print only the unique results: $ cat location-*.out | sort | uniq","title":"Collating your results"},{"location":"materials/osg/part1-ex2-submit-osg/#mapping-your-results","text":"To visualize the locations of the Execution Points that your jobs ran on, you will be using http://www.mapcustomizer.com/ . Copy and paste the collated results into the text box that pops up when clicking on the 'Bulk Entry' button on the right-hand side. Where did your jobs run?","title":"Mapping your results"},{"location":"materials/osg/part1-ex2-submit-osg/#next-exercise","text":"Once completed, move onto the next exercise: Hardware Differences in the OSG","title":"Next exercise"},{"location":"materials/osg/part1-ex2-submit-osg/#extra-challenge-cleaning-up-your-submit-directory","text":"If you run ls in the directory from which you submitted your job, you may see that you now have thousands of files! Proper data management starts to become a requirement as you start to develop true HTC workflows; it may be helpful to separate your submit files, code, and input data from your output data. Try editing your submit file so that all your output and error files are saved to separate directories within your submit directory. Tip Experiment with fewer job submissions until you\u2019re confident you have it right, then go back to submitting 500 jobs. Remember: Test small and scale up! Submit your file and track the status of your jobs. Did your jobs complete successfully with output and error files saved in separate directories? If not, can you find any useful information in the job logs or hold messages? If you get stuck, review the slides from Tuesday .","title":"Extra Challenge: Cleaning up your submit directory"},{"location":"materials/osg/part1-ex3-hardware-diffs/","text":"OSG Exercise 1.3: Hardware Differences Between PATh and OSG \u00b6 The goal of this exercise is to compare hardware differences between the Monday cluster (the PATh Facility) and the Open Science Pool. Specifically, we will look at how easy it is to get access to resources in terms of the amount of memory that is requested. This will not be a very careful study, but should give you some idea of one way in which the pools are different. In the first two parts of the exercise, you will submit batches of jobs that differ only in how much memory each one requests. This is called this a parameter sweep , in that we are testing many possible values of a parameter. We will request memory from 8\u201364 GB, doubling the memory each time. One set of jobs will be submitted to the PATh Facility, and the other, identical set of jobs will be submitted to the OSPool. You will check the queue periodically to see how many jobs have completed and how many are still waiting to run. Checking PATh memory availability \u00b6 In this first part, you will create the submit file that will be used for both the PATh and OSPool jobs, then submit the PATh set. Yet another queue syntax \u00b6 Earlier, you learned about the queue statement and some of the different ways it can be invoked to submit multiple jobs. Similar to the queue from statement to submit jobs based on lines from a specific file, you can use queue in to submit jobs based on a list that is written directly in your submit file: queue <# of jobs> in ( ... ) For example, to submit 6 total jobs that sleep for 5 , 5 , 10 , 10 , 15 , and 15 seconds, you could write the following submit file: executable = /bin/sleep request_cpus = 1 request_memory = 1MB request_disk = 1MB queue 2 arguments in ( 5 10 15 ) Try submitting this yourself and verify that all six jobs are in the queue, using the condor_q -nobatch command. Create the submit file \u00b6 To create our parameter sweep, we will create a new submit file with the queue\u2026in syntax and change the value of our parameter ( request_memory ) for each batch of jobs. Log in or switch back to ap1.facility.path-cc.io (yes, back to PATh!) Create and change into a new subdirectory called osg-ex13 Create a submit file named sleep.sub that executes the command /bin/sleep 300 . Note If you do not remember all of the submit statements to write this file, or just to go faster, find a similar submit file from a previous exercise. Copy the file and rename it here, and make sure the argument to sleep is 300 . Use the queue\u2026in syntax to submit 10 jobs each for the following memory requests: 8, 16, 32, and 64 GB. There will be 40 jobs total: 10 jobs requesting 8 GB, 10 requesting 16 GB, etc. Submit your jobs Monitoring the local jobs \u00b6 Every few minutes, run condor_q and see how your sleep jobs are doing. To display the number of jobs remaining for each request_memory parameter specified, run the following command: $ condor_q -af RequestMemory | sort -n | uniq -c The numbers in the left column are the number of jobs left of that type and the number on the right is the amount of memory you requested, in MB. Consider making a little table like the one below to track progress. Memory Remaining #1 Remaining #2 Remaining #3 8 GB 10 6 16 GB 10 7 32 GB 10 8 64 GB 10 9 In the meantime, between checking on your local jobs, start the next section \u2013 but take a break every few minutes to switch back to ap1 and record progress on your PATh jobs. Checking OSPool memory availability \u00b6 Now you will do essentially the same thing on the OSPool. Log in or switch to ap40.uw.osg-htc.org Copy the osg-ex13 directory from the section above from ap1.facility.path-cc.io to ap40.uw.osg-htc.org If you get stuck during the copying process, refer to OSG exercise 1.1 . Submit the jobs to the OSPool Monitoring the remote jobs \u00b6 As you did in the first part, use condor_q to track how your sleep jobs are doing. It is fine to move on to the next exercise, but keep tracking the status of both sets of these jobs. After you are done with the next exercise , come back to this exercise and analyze the results. Analyzing the results \u00b6 Have all of your jobs from this exercise completed on both PATh and the OSPool? How many jobs have completed thus far on PATh? How many have completed thus far on the OSPool? Due to the dynamic nature of the OSPool, the demand for higher memory jobs there may have resulted in a temporary increase in high-memory slots there. That being said, high-memory are a high-demand, low-availability resource in the OSPool so your 64 GB jobs may have taken longer to run or complete. On the other hand, PATh has a fair number of 64 GB (and greater) slots so all your jobs have a high chance of running.","title":"1.3 - Hardware differences between PATh and OSG"},{"location":"materials/osg/part1-ex3-hardware-diffs/#osg-exercise-13-hardware-differences-between-path-and-osg","text":"The goal of this exercise is to compare hardware differences between the Monday cluster (the PATh Facility) and the Open Science Pool. Specifically, we will look at how easy it is to get access to resources in terms of the amount of memory that is requested. This will not be a very careful study, but should give you some idea of one way in which the pools are different. In the first two parts of the exercise, you will submit batches of jobs that differ only in how much memory each one requests. This is called this a parameter sweep , in that we are testing many possible values of a parameter. We will request memory from 8\u201364 GB, doubling the memory each time. One set of jobs will be submitted to the PATh Facility, and the other, identical set of jobs will be submitted to the OSPool. You will check the queue periodically to see how many jobs have completed and how many are still waiting to run.","title":"OSG Exercise 1.3: Hardware Differences Between PATh and OSG"},{"location":"materials/osg/part1-ex3-hardware-diffs/#checking-path-memory-availability","text":"In this first part, you will create the submit file that will be used for both the PATh and OSPool jobs, then submit the PATh set.","title":"Checking PATh memory availability"},{"location":"materials/osg/part1-ex3-hardware-diffs/#yet-another-queue-syntax","text":"Earlier, you learned about the queue statement and some of the different ways it can be invoked to submit multiple jobs. Similar to the queue from statement to submit jobs based on lines from a specific file, you can use queue in to submit jobs based on a list that is written directly in your submit file: queue <# of jobs> in ( ... ) For example, to submit 6 total jobs that sleep for 5 , 5 , 10 , 10 , 15 , and 15 seconds, you could write the following submit file: executable = /bin/sleep request_cpus = 1 request_memory = 1MB request_disk = 1MB queue 2 arguments in ( 5 10 15 ) Try submitting this yourself and verify that all six jobs are in the queue, using the condor_q -nobatch command.","title":"Yet another queue syntax"},{"location":"materials/osg/part1-ex3-hardware-diffs/#create-the-submit-file","text":"To create our parameter sweep, we will create a new submit file with the queue\u2026in syntax and change the value of our parameter ( request_memory ) for each batch of jobs. Log in or switch back to ap1.facility.path-cc.io (yes, back to PATh!) Create and change into a new subdirectory called osg-ex13 Create a submit file named sleep.sub that executes the command /bin/sleep 300 . Note If you do not remember all of the submit statements to write this file, or just to go faster, find a similar submit file from a previous exercise. Copy the file and rename it here, and make sure the argument to sleep is 300 . Use the queue\u2026in syntax to submit 10 jobs each for the following memory requests: 8, 16, 32, and 64 GB. There will be 40 jobs total: 10 jobs requesting 8 GB, 10 requesting 16 GB, etc. Submit your jobs","title":"Create the submit file"},{"location":"materials/osg/part1-ex3-hardware-diffs/#monitoring-the-local-jobs","text":"Every few minutes, run condor_q and see how your sleep jobs are doing. To display the number of jobs remaining for each request_memory parameter specified, run the following command: $ condor_q -af RequestMemory | sort -n | uniq -c The numbers in the left column are the number of jobs left of that type and the number on the right is the amount of memory you requested, in MB. Consider making a little table like the one below to track progress. Memory Remaining #1 Remaining #2 Remaining #3 8 GB 10 6 16 GB 10 7 32 GB 10 8 64 GB 10 9 In the meantime, between checking on your local jobs, start the next section \u2013 but take a break every few minutes to switch back to ap1 and record progress on your PATh jobs.","title":"Monitoring the local jobs"},{"location":"materials/osg/part1-ex3-hardware-diffs/#checking-ospool-memory-availability","text":"Now you will do essentially the same thing on the OSPool. Log in or switch to ap40.uw.osg-htc.org Copy the osg-ex13 directory from the section above from ap1.facility.path-cc.io to ap40.uw.osg-htc.org If you get stuck during the copying process, refer to OSG exercise 1.1 . Submit the jobs to the OSPool","title":"Checking OSPool memory availability"},{"location":"materials/osg/part1-ex3-hardware-diffs/#monitoring-the-remote-jobs","text":"As you did in the first part, use condor_q to track how your sleep jobs are doing. It is fine to move on to the next exercise, but keep tracking the status of both sets of these jobs. After you are done with the next exercise , come back to this exercise and analyze the results.","title":"Monitoring the remote jobs"},{"location":"materials/osg/part1-ex3-hardware-diffs/#analyzing-the-results","text":"Have all of your jobs from this exercise completed on both PATh and the OSPool? How many jobs have completed thus far on PATh? How many have completed thus far on the OSPool? Due to the dynamic nature of the OSPool, the demand for higher memory jobs there may have resulted in a temporary increase in high-memory slots there. That being said, high-memory are a high-demand, low-availability resource in the OSPool so your 64 GB jobs may have taken longer to run or complete. On the other hand, PATh has a fair number of 64 GB (and greater) slots so all your jobs have a high chance of running.","title":"Analyzing the results"},{"location":"materials/osg/part1-ex4-software-diffs/","text":"OSG Exercise 1.4: Software Differences in OSPool \u00b6 The goal of this exercise is to see some differences in the availability of software in the OSPool. At your local cluster, you may be used to having certain versions of software. But in the OSPool, it is likely that the software you need will not be available at all. Comparing operating systems \u00b6 To really see differences between Execution Points in the PATh Facility versus the OSPool, you will want to compare the \u201cmachine\u201d ClassAds between the two pools. Rather than inspecting the very long ClassAd for each Execution Point, you will look at a specific attribute called OpSysAndVer , which tells us the operating system version of the Execution Point. An easy way to show this attribute for all Execution Points is by using condor_status in conjunction with the -autoformat (or -af , for short) option. The -autoformat option is like the -format option you learned about earlier, and outputs the attributes you choose for each slot; but as you may have guessed, it does some automatic formatting for you. So, let\u2019s examine the operating system and (major) version of slots on the PATh Facility and the OSPool. Log in or switch to ap1.facility.path-cc.io and run the following command: $ condor_status -autoformat OpSysAndVer Log in or switch to ap40.uw.osg-htc.org (parallel windows are handy!) and run the same command You will see many values for the operating system and major version. Some are abbreviated \u2014 for example, RedHat stands for \u201cRed Hat Enterprise Linux\u201d and SL stands for \u201cScientific Linux\u201d (a Red Hat variant). The only problem is that with hundreds or thousands of slots, it's difficult to get a feel for the composition of each pool from this output. You can use the sort and uniq commands, in sequence, on the condor_status output to get counts of each unique operating system and version string. Your command line should look something like this: $ condor_status -autoformat OpSysAndVer | sort | uniq -c How would you describe the difference between the PATh Facility and OSPool? Submitting probe jobs \u00b6 Now you have some idea of the diversity of operating systems on the OSPool. This is a step in the right direction to knowing what software is available in general. But what you really want to know is whether your specific software tool (and version) is available. Software probe code \u00b6 The following shell script probes for software and returns the version if it is installed: #!/bin/sh get_version (){ program = $1 $program --version > /dev/null 2 > & 1 double_dash_rc = $? $program -version > /dev/null 2 > & 1 single_dash_rc = $? which $program > /dev/null 2 > & 1 which_rc = $? if [ $double_dash_rc -eq 0 ] ; then $program --version 2 > & 1 elif [ $single_dash_rc -eq 0 ] ; then $program -version 2 > & 1 elif [ $which_rc -eq 0 ] ; then echo \" $program installed but could not find version information\" else echo \" $program not installed\" fi } get_version 'R' get_version 'cmake' get_version 'python' get_version 'python3' If there's a specific command line program that your research requires, feel free to add it to the script! For example, if you wanted to test for the existence and version of nslookup , you would add the following to the end of the script: get_version 'nslookup' Probing several servers \u00b6 For this part of the exercise, try creating a submit file without referring to previous exercises! Log in or switch to ap40.uw.osg-htc.org Create and change into a new folder for this exercise, e.g. osg-ex14 Save the above script as a file named sw_probe.sh Make sure the script can be run: chmod a+x sw_probe.sh Try running the script in place to make sure it works: ./sw_probe.sh Create a submit file that runs sw_probe.sh 100 times and uses macros to write different output , error , and log files Submit your job and wait for the results Will you be able to do your research on the OSG with what's available? Do not worry if it does not seem like you can: Later today, you will learn how to make your jobs portable enough so that they can run anywhere!","title":"1.4 - Software differences in OSPool"},{"location":"materials/osg/part1-ex4-software-diffs/#osg-exercise-14-software-differences-in-ospool","text":"The goal of this exercise is to see some differences in the availability of software in the OSPool. At your local cluster, you may be used to having certain versions of software. But in the OSPool, it is likely that the software you need will not be available at all.","title":"OSG Exercise 1.4: Software Differences in OSPool"},{"location":"materials/osg/part1-ex4-software-diffs/#comparing-operating-systems","text":"To really see differences between Execution Points in the PATh Facility versus the OSPool, you will want to compare the \u201cmachine\u201d ClassAds between the two pools. Rather than inspecting the very long ClassAd for each Execution Point, you will look at a specific attribute called OpSysAndVer , which tells us the operating system version of the Execution Point. An easy way to show this attribute for all Execution Points is by using condor_status in conjunction with the -autoformat (or -af , for short) option. The -autoformat option is like the -format option you learned about earlier, and outputs the attributes you choose for each slot; but as you may have guessed, it does some automatic formatting for you. So, let\u2019s examine the operating system and (major) version of slots on the PATh Facility and the OSPool. Log in or switch to ap1.facility.path-cc.io and run the following command: $ condor_status -autoformat OpSysAndVer Log in or switch to ap40.uw.osg-htc.org (parallel windows are handy!) and run the same command You will see many values for the operating system and major version. Some are abbreviated \u2014 for example, RedHat stands for \u201cRed Hat Enterprise Linux\u201d and SL stands for \u201cScientific Linux\u201d (a Red Hat variant). The only problem is that with hundreds or thousands of slots, it's difficult to get a feel for the composition of each pool from this output. You can use the sort and uniq commands, in sequence, on the condor_status output to get counts of each unique operating system and version string. Your command line should look something like this: $ condor_status -autoformat OpSysAndVer | sort | uniq -c How would you describe the difference between the PATh Facility and OSPool?","title":"Comparing operating systems"},{"location":"materials/osg/part1-ex4-software-diffs/#submitting-probe-jobs","text":"Now you have some idea of the diversity of operating systems on the OSPool. This is a step in the right direction to knowing what software is available in general. But what you really want to know is whether your specific software tool (and version) is available.","title":"Submitting probe jobs"},{"location":"materials/osg/part1-ex4-software-diffs/#software-probe-code","text":"The following shell script probes for software and returns the version if it is installed: #!/bin/sh get_version (){ program = $1 $program --version > /dev/null 2 > & 1 double_dash_rc = $? $program -version > /dev/null 2 > & 1 single_dash_rc = $? which $program > /dev/null 2 > & 1 which_rc = $? if [ $double_dash_rc -eq 0 ] ; then $program --version 2 > & 1 elif [ $single_dash_rc -eq 0 ] ; then $program -version 2 > & 1 elif [ $which_rc -eq 0 ] ; then echo \" $program installed but could not find version information\" else echo \" $program not installed\" fi } get_version 'R' get_version 'cmake' get_version 'python' get_version 'python3' If there's a specific command line program that your research requires, feel free to add it to the script! For example, if you wanted to test for the existence and version of nslookup , you would add the following to the end of the script: get_version 'nslookup'","title":"Software probe code"},{"location":"materials/osg/part1-ex4-software-diffs/#probing-several-servers","text":"For this part of the exercise, try creating a submit file without referring to previous exercises! Log in or switch to ap40.uw.osg-htc.org Create and change into a new folder for this exercise, e.g. osg-ex14 Save the above script as a file named sw_probe.sh Make sure the script can be run: chmod a+x sw_probe.sh Try running the script in place to make sure it works: ./sw_probe.sh Create a submit file that runs sw_probe.sh 100 times and uses macros to write different output , error , and log files Submit your job and wait for the results Will you be able to do your research on the OSG with what's available? Do not worry if it does not seem like you can: Later today, you will learn how to make your jobs portable enough so that they can run anywhere!","title":"Probing several servers"},{"location":"materials/scaling/part1-ex1-organization/","text":"Organizing HTC Workloads \u00b6 Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. This exercise is similar to HTCondor exercise 2.4, in that it is about counting word frequencies in multiple files. But the focus here is on organizing the files more effectively on the Access Point, with an eye to scaling up to a larger HTC workload in the future. Log into an OSPool Access Point \u00b6 Make sure you are logged into ap40.uw.osg-htc.org . Get Files \u00b6 To get the files for this exercise: Type wget https://github.com/osg-htc/school-2024/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz to download the tarball. As you learned earlier, expand this tarball file; it will create a organizing-files directory. Change to that directory, or create a separate one for this exercise and copy the files in. Our Workload \u00b6 We can analyze one book by running the wordcount.py script, with the name of the book we want to analyze: $ ./wordcount.py Alice_in_Wonderland.txt Try running the command to see what the output is for the script. Once you have done that delete the output file created ( rm counts.Alice_in_Wonderland.txt ). We want to run this script on all the books we have copies of. What is the input set for this HTC workload? What is the output set? Make an Organization Plan \u00b6 Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the Access Point? There will also be system and HTCondor files produced when we submit a job \u2014 how would you organize the log, standard output, and standard error files? Try making those changes before moving on to the next section of the tutorial. Organize Files \u00b6 There are many different ways to organize files; a simple method that works for most workloads is having a directory for your input files and a directory for your output files. Set up this structure on the command line by running: $ mkdir input $ mv *.txt input/ $ mkdir output View the current directory and its subdirectories by using the ls command with the recursive ( -R ) flag: $ ls -R README.md books.submit input output wordcount.py ./input: Alice_in_Wonderland.txt Huckleberry_Finn.txt Dracula.txt Pride_and_Prejudice.txt ./output: Next, create directories for the HTCondor log, standard output, and standard output files (in one directory): $ mkdir logs $ mkdir errout Submit One Job \u00b6 Now we want to submit a test job that uses this organizing scheme, using just one item in our input set \u2014 in this example, we will use the Alice_in_Wonderland.txt file from our input directory. Fill in the incomplete lines of the submit file, as shown below: executable = wordcount.py arguments = Alice_in_Wonderland.txt transfer_input_files = input/Alice_in_Wonderland.txt transfer_output_files = counts.Alice_in_Wonderland.txt transfer_output_remaps = \"counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt\" To tell HTCondor the location of the input file, we need to include the input directory. Also, this submit file uses the transfer_output_remaps feature that you learned about; it will move the output file to the output directory by renaming or remapping it. Next, edit the submit file lines that tell the log, output, and error files where to go: output = logs/job.$(ClusterID).$(ProcID).out error = errout/job.$(ClusterID).$(ProcID).err log = errout/job.$(ClusterID).$(ProcID).log Submit your job and monitor its progress. Submit Multiple Jobs \u00b6 Now, you are ready to submit the whole workload. Create a file with the list of input files (the input set); here, this is the list of the book files to analyze. Do this by using the shell ls command and redirecting its output to a file: $ ls input > booklist.txt $ cat booklist.txt Modify the submit file to reference the file of inputs and replace the fixed value ( Alice_in_Wonderland.txt ) with a variable ( $(book) ): executable = wordcount.py arguments = $(book) transfer_input_files = input/$(book) transfer_output_files = counts.$(book) transfer_output_remaps = \"counts.$(book)=output/counts.$(book)\" queue book from booklist.txt Submit the jobs When complete, look at the complete set of input and (now) output files to see how they are organized.","title":"1.1 - Organizing HTC workloads"},{"location":"materials/scaling/part1-ex1-organization/#organizing-htc-workloads","text":"Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. This exercise is similar to HTCondor exercise 2.4, in that it is about counting word frequencies in multiple files. But the focus here is on organizing the files more effectively on the Access Point, with an eye to scaling up to a larger HTC workload in the future.","title":"Organizing HTC Workloads"},{"location":"materials/scaling/part1-ex1-organization/#log-into-an-ospool-access-point","text":"Make sure you are logged into ap40.uw.osg-htc.org .","title":"Log into an OSPool Access Point"},{"location":"materials/scaling/part1-ex1-organization/#get-files","text":"To get the files for this exercise: Type wget https://github.com/osg-htc/school-2024/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz to download the tarball. As you learned earlier, expand this tarball file; it will create a organizing-files directory. Change to that directory, or create a separate one for this exercise and copy the files in.","title":"Get Files"},{"location":"materials/scaling/part1-ex1-organization/#our-workload","text":"We can analyze one book by running the wordcount.py script, with the name of the book we want to analyze: $ ./wordcount.py Alice_in_Wonderland.txt Try running the command to see what the output is for the script. Once you have done that delete the output file created ( rm counts.Alice_in_Wonderland.txt ). We want to run this script on all the books we have copies of. What is the input set for this HTC workload? What is the output set?","title":"Our Workload"},{"location":"materials/scaling/part1-ex1-organization/#make-an-organization-plan","text":"Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the Access Point? There will also be system and HTCondor files produced when we submit a job \u2014 how would you organize the log, standard output, and standard error files? Try making those changes before moving on to the next section of the tutorial.","title":"Make an Organization Plan"},{"location":"materials/scaling/part1-ex1-organization/#organize-files","text":"There are many different ways to organize files; a simple method that works for most workloads is having a directory for your input files and a directory for your output files. Set up this structure on the command line by running: $ mkdir input $ mv *.txt input/ $ mkdir output View the current directory and its subdirectories by using the ls command with the recursive ( -R ) flag: $ ls -R README.md books.submit input output wordcount.py ./input: Alice_in_Wonderland.txt Huckleberry_Finn.txt Dracula.txt Pride_and_Prejudice.txt ./output: Next, create directories for the HTCondor log, standard output, and standard output files (in one directory): $ mkdir logs $ mkdir errout","title":"Organize Files"},{"location":"materials/scaling/part1-ex1-organization/#submit-one-job","text":"Now we want to submit a test job that uses this organizing scheme, using just one item in our input set \u2014 in this example, we will use the Alice_in_Wonderland.txt file from our input directory. Fill in the incomplete lines of the submit file, as shown below: executable = wordcount.py arguments = Alice_in_Wonderland.txt transfer_input_files = input/Alice_in_Wonderland.txt transfer_output_files = counts.Alice_in_Wonderland.txt transfer_output_remaps = \"counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt\" To tell HTCondor the location of the input file, we need to include the input directory. Also, this submit file uses the transfer_output_remaps feature that you learned about; it will move the output file to the output directory by renaming or remapping it. Next, edit the submit file lines that tell the log, output, and error files where to go: output = logs/job.$(ClusterID).$(ProcID).out error = errout/job.$(ClusterID).$(ProcID).err log = errout/job.$(ClusterID).$(ProcID).log Submit your job and monitor its progress.","title":"Submit One Job"},{"location":"materials/scaling/part1-ex1-organization/#submit-multiple-jobs","text":"Now, you are ready to submit the whole workload. Create a file with the list of input files (the input set); here, this is the list of the book files to analyze. Do this by using the shell ls command and redirecting its output to a file: $ ls input > booklist.txt $ cat booklist.txt Modify the submit file to reference the file of inputs and replace the fixed value ( Alice_in_Wonderland.txt ) with a variable ( $(book) ): executable = wordcount.py arguments = $(book) transfer_input_files = input/$(book) transfer_output_files = counts.$(book) transfer_output_remaps = \"counts.$(book)=output/counts.$(book)\" queue book from booklist.txt Submit the jobs When complete, look at the complete set of input and (now) output files to see how they are organized.","title":"Submit Multiple Jobs"},{"location":"materials/scaling/part1-ex2-job-attributes/","text":"Exercise 1.2: Investigating Job Attributes \u00b6 The objective of this exercise is to your awareness of job \"class ad attributes\", especially ones that may help you look for issues with your jobs in the OSPool. Recall that a job class ad contains attributes and their values that describe what HTCondor knows about the job. OSPool jobs contain extra attributes that are specific to that pool. Thus, an OSPool job class ad may have well over 150 attributes. Some OSPool job attributes are especially helpful when you are scaling up jobs and want to see if jobs are running as expected or are maybe doing surprising things that are worth extra attention. Preparing exercise files \u00b6 Because this exercise focuses on OSPool job attributes, please use your OSPool account on ap40.uw.osg-htc.org . Create a shell script for testing called simple.sh : #!/bin/bash SLEEPTIME=$1 hostname pwd whoami for i in {1..5} do echo \"performing iteration $i\" sleep $SLEEPTIME done Create an HTCondor submit file that queues three jobs: universe = vanilla log = logs/$(Cluster)_$(Process).log error = logs/$(Cluster)_$(Process).err output = $(Cluster)_$(Process).out executable = simple.sh should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1GB # set arguments, queue a normal job arguments = 600 queue 1 # queue a job that will go on hold transfer_input_files = test.txt queue 1 # queue a job that will never start request_memory = 40TB queue 1 Exploring OSPool job class ad attributes \u00b6 For this exercise, you will submit the three jobs defined in the submit file above, then examine their job class ad attributes. Here are some attributes that may be interesting: CpusProvisioned is the number of CPUs given to your job for the current or most recent run ResidentSetSize_RAW is the maximum amount of memory that HTCondor has noticed your job using (in KB) DiskUsage_RAW is the maximum amount of disk that HTCondor has noticed your job using (in MB) NumJobStarts is the number of times HTCondor has started your job; 1 is typical for a running job, and higher counts may indicate issues running the job LastRemoteHost identifies the name for the slot where your job is running or most recently ran MachineAttrGLIDEIN_ResourceName*N* is a set of numbered attributes that identify the most recent sites where your job ran; N is 0 for the most recent (or current) run, 1 for the previous run, and so on up to 9 ExitCode exists only if your job exited (completed) at least once; a value of 0 typically means success HoldReasonCode exists only if your job went on hold; if so, it is a number corresponding to the main hold reason (see here for details) NumHoldsByReason is a list of all of the main reasons your job has gone on hold so far with counts of each hold type Let\u2019s explore these attributes on real jobs. Submit the jobs (above) and note the cluster ID When one job from the cluster is running, view all of its job class ad attributes: $ condor_q -l where is your job's ID, and -l stands for -long This command lists all of the job\u2019s class ad attributes. Details of some of the attributes are in the HTcondor Manual . Others are defined (and not well documented) only for the OSPool. Can you find any of the attributes listed above? Next, use condor_q -af to examine one attribute at a time for several jobs: $ condor_q -af NumJobStarts where is the HTCondor cluster ID noted above, and -af stands for -autoformat . What does the output tell you? Finally, display several attributes at once for the jobs: $ condor_q -af:j NumJobStarts DiskUsage_RAW LastRemoteHost HoldReasonCode Why do some values appear as undefined ?","title":"1.2 - Investigating Job Attributes"},{"location":"materials/scaling/part1-ex2-job-attributes/#exercise-12-investigating-job-attributes","text":"The objective of this exercise is to your awareness of job \"class ad attributes\", especially ones that may help you look for issues with your jobs in the OSPool. Recall that a job class ad contains attributes and their values that describe what HTCondor knows about the job. OSPool jobs contain extra attributes that are specific to that pool. Thus, an OSPool job class ad may have well over 150 attributes. Some OSPool job attributes are especially helpful when you are scaling up jobs and want to see if jobs are running as expected or are maybe doing surprising things that are worth extra attention.","title":"Exercise 1.2: Investigating Job Attributes"},{"location":"materials/scaling/part1-ex2-job-attributes/#preparing-exercise-files","text":"Because this exercise focuses on OSPool job attributes, please use your OSPool account on ap40.uw.osg-htc.org . Create a shell script for testing called simple.sh : #!/bin/bash SLEEPTIME=$1 hostname pwd whoami for i in {1..5} do echo \"performing iteration $i\" sleep $SLEEPTIME done Create an HTCondor submit file that queues three jobs: universe = vanilla log = logs/$(Cluster)_$(Process).log error = logs/$(Cluster)_$(Process).err output = $(Cluster)_$(Process).out executable = simple.sh should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1GB # set arguments, queue a normal job arguments = 600 queue 1 # queue a job that will go on hold transfer_input_files = test.txt queue 1 # queue a job that will never start request_memory = 40TB queue 1","title":"Preparing exercise files"},{"location":"materials/scaling/part1-ex2-job-attributes/#exploring-ospool-job-class-ad-attributes","text":"For this exercise, you will submit the three jobs defined in the submit file above, then examine their job class ad attributes. Here are some attributes that may be interesting: CpusProvisioned is the number of CPUs given to your job for the current or most recent run ResidentSetSize_RAW is the maximum amount of memory that HTCondor has noticed your job using (in KB) DiskUsage_RAW is the maximum amount of disk that HTCondor has noticed your job using (in MB) NumJobStarts is the number of times HTCondor has started your job; 1 is typical for a running job, and higher counts may indicate issues running the job LastRemoteHost identifies the name for the slot where your job is running or most recently ran MachineAttrGLIDEIN_ResourceName*N* is a set of numbered attributes that identify the most recent sites where your job ran; N is 0 for the most recent (or current) run, 1 for the previous run, and so on up to 9 ExitCode exists only if your job exited (completed) at least once; a value of 0 typically means success HoldReasonCode exists only if your job went on hold; if so, it is a number corresponding to the main hold reason (see here for details) NumHoldsByReason is a list of all of the main reasons your job has gone on hold so far with counts of each hold type Let\u2019s explore these attributes on real jobs. Submit the jobs (above) and note the cluster ID When one job from the cluster is running, view all of its job class ad attributes: $ condor_q -l where is your job's ID, and -l stands for -long This command lists all of the job\u2019s class ad attributes. Details of some of the attributes are in the HTcondor Manual . Others are defined (and not well documented) only for the OSPool. Can you find any of the attributes listed above? Next, use condor_q -af to examine one attribute at a time for several jobs: $ condor_q -af NumJobStarts where is the HTCondor cluster ID noted above, and -af stands for -autoformat . What does the output tell you? Finally, display several attributes at once for the jobs: $ condor_q -af:j NumJobStarts DiskUsage_RAW LastRemoteHost HoldReasonCode Why do some values appear as undefined ?","title":"Exploring OSPool job class ad attributes"},{"location":"materials/scaling/part1-ex3-log-files/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Getting Job Information from Log Files \u00b6 HTCondor job log files contain useful information about submitted, running, and/or completed jobs, but the format of that information may not always be useful to you . Here, we have a few examples of how to use some powerful Unix commands ( grep , sort , uniq ) to pull information out of these job log files. It is now time for you to try these on your own jobs! Before starting this exercise, copy a couple of your job log files from previous exercises (for example, HTC Exercise 1.5 and/or OSG Exercise 1.1) in to a new directory for this exercise. Use these log files in place of my-job.log in the examples below. The grep command displays lines from a file matching a given pattern, where the pattern is the first argument provided to grep . For example grep 'alice' address_book.txt would print out all lines containing the characters alice in the file named address_book.txt . While working through this exercise, consider keeping one of your job log files open in a separate window to see if you can figure out how we came up with the patterns presented in this exercise. Job terminations \u00b6 Lines for job termination events in the job log always start with 005 and contain the timestamp of when the job(s) ended. Use the following grep command to get a list of when jobs ended in your log files: $ grep '^005' my-job.log Optional challenge : What is the importance of ^ in the pattern ( ^005 ) provided above? Recall that executables typically exit with code 0 when they exit normally, which often (but not always!) means that they exited successfully. Lines containing jobs' exit codes (i.e. return values) all contain the word termination . Use grep to get a list of jobs' exit codes: $ grep termination my-job.log By \"piping\" the output of the previous command through the sort and then uniq commands, we can get a count of each exit code: $ grep termination my-job.log | sort | uniq -c Here's an example of the output from the previous commands when run on a log file written to from eight jobs. Six jobs exited with exit code 0 , while two exited 1 : [username@ap40]$ grep '^005' my-job.log 005 (236881.000.000) 2022-07-27 15:07:38 Job terminated. 005 (236883.000.000) 2022-07-27 15:07:42 Job terminated. 005 (236882.000.000) 2022-07-27 15:08:01 Job terminated. 005 (236880.000.000) 2022-07-27 15:08:07 Job terminated. 005 (236891.000.000) 2022-07-27 15:13:31 Job terminated. 005 (236893.000.000) 2022-07-27 15:13:32 Job terminated. 005 (236892.000.000) 2022-07-27 15:13:58 Job terminated. 005 (236890.000.000) 2022-07-27 15:13:59 Job terminated. [username@ap40]$ grep 'termination' my-job.log (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 0) [username@ap40]$ grep 'termination' my-job.log | sort | uniq -c 6 (1) Normal termination (return value 0) 2 (1) Normal termination (return value 1) Job resource usage \u00b6 Jobs' resource usages (and requests and allocations) are logged in the following format: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 10382 1048576 1468671 Memory (MB) : 692 1024 1024 Run the following grep command to pull out the memory information from your job logs: $ grep 'Memory (MB) *:' my-job.log Look back at the format in the example above. Columns after the : will first show memory usage, then memory requested, and then the memory allocated to your job. Similarly, use the following command to get the disk information from your job logs: $ grep 'Disk (KB) *:' my-job.log Here's some example output from running the memory grep command on the same eight-job log file: [username@ap40]$ grep 'Memory (MB) *:' my-job.log Memory (MB) : 692 1024 1024 Memory (MB) : 714 1024 1024 Memory (MB) : 703 1024 1024 Memory (MB) : 699 1024 1024 Memory (MB) : 705 1024 1024 Memory (MB) : 704 1024 1024 Memory (MB) : 711 1024 1024 Memory (MB) : 697 1024 1024 In this example, the memory usage for the jobs ranged from 692 to 714 MB, and they all requested (and were allocated) 1 GB of memory. Other job information \u00b6 See if you can come up with grep commands to gather the number of bytes sent and received by jobs (i.e. how much data was transferred to/from the access point). Here is some example output for comparison: [username@ap40]$ grep '' my-job.log 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job [username@ap40]$ grep '' my-job.log 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job Job log files may also contain additional information about held jobs or interrupted jobs. If you feel that your jobs are bouncing from idle to running and back to idle, or that they are otherwise not making as much progress as you expect, the log files are a good place to check. Though they might eventually become impossibly large to read line-by-line once you start scaling up, using grep to pull out specific lines and using sort and uniq to reduce the output can help you make sense of the information contained in the logs.","title":"1.3 - Getting Job Information from Log Files"},{"location":"materials/scaling/part1-ex3-log-files/#getting-job-information-from-log-files","text":"HTCondor job log files contain useful information about submitted, running, and/or completed jobs, but the format of that information may not always be useful to you . Here, we have a few examples of how to use some powerful Unix commands ( grep , sort , uniq ) to pull information out of these job log files. It is now time for you to try these on your own jobs! Before starting this exercise, copy a couple of your job log files from previous exercises (for example, HTC Exercise 1.5 and/or OSG Exercise 1.1) in to a new directory for this exercise. Use these log files in place of my-job.log in the examples below. The grep command displays lines from a file matching a given pattern, where the pattern is the first argument provided to grep . For example grep 'alice' address_book.txt would print out all lines containing the characters alice in the file named address_book.txt . While working through this exercise, consider keeping one of your job log files open in a separate window to see if you can figure out how we came up with the patterns presented in this exercise.","title":"Getting Job Information from Log Files"},{"location":"materials/scaling/part1-ex3-log-files/#job-terminations","text":"Lines for job termination events in the job log always start with 005 and contain the timestamp of when the job(s) ended. Use the following grep command to get a list of when jobs ended in your log files: $ grep '^005' my-job.log Optional challenge : What is the importance of ^ in the pattern ( ^005 ) provided above? Recall that executables typically exit with code 0 when they exit normally, which often (but not always!) means that they exited successfully. Lines containing jobs' exit codes (i.e. return values) all contain the word termination . Use grep to get a list of jobs' exit codes: $ grep termination my-job.log By \"piping\" the output of the previous command through the sort and then uniq commands, we can get a count of each exit code: $ grep termination my-job.log | sort | uniq -c Here's an example of the output from the previous commands when run on a log file written to from eight jobs. Six jobs exited with exit code 0 , while two exited 1 : [username@ap40]$ grep '^005' my-job.log 005 (236881.000.000) 2022-07-27 15:07:38 Job terminated. 005 (236883.000.000) 2022-07-27 15:07:42 Job terminated. 005 (236882.000.000) 2022-07-27 15:08:01 Job terminated. 005 (236880.000.000) 2022-07-27 15:08:07 Job terminated. 005 (236891.000.000) 2022-07-27 15:13:31 Job terminated. 005 (236893.000.000) 2022-07-27 15:13:32 Job terminated. 005 (236892.000.000) 2022-07-27 15:13:58 Job terminated. 005 (236890.000.000) 2022-07-27 15:13:59 Job terminated. [username@ap40]$ grep 'termination' my-job.log (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 1) (1) Normal termination (return value 0) (1) Normal termination (return value 0) (1) Normal termination (return value 0) [username@ap40]$ grep 'termination' my-job.log | sort | uniq -c 6 (1) Normal termination (return value 0) 2 (1) Normal termination (return value 1)","title":"Job terminations"},{"location":"materials/scaling/part1-ex3-log-files/#job-resource-usage","text":"Jobs' resource usages (and requests and allocations) are logged in the following format: Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 10382 1048576 1468671 Memory (MB) : 692 1024 1024 Run the following grep command to pull out the memory information from your job logs: $ grep 'Memory (MB) *:' my-job.log Look back at the format in the example above. Columns after the : will first show memory usage, then memory requested, and then the memory allocated to your job. Similarly, use the following command to get the disk information from your job logs: $ grep 'Disk (KB) *:' my-job.log Here's some example output from running the memory grep command on the same eight-job log file: [username@ap40]$ grep 'Memory (MB) *:' my-job.log Memory (MB) : 692 1024 1024 Memory (MB) : 714 1024 1024 Memory (MB) : 703 1024 1024 Memory (MB) : 699 1024 1024 Memory (MB) : 705 1024 1024 Memory (MB) : 704 1024 1024 Memory (MB) : 711 1024 1024 Memory (MB) : 697 1024 1024 In this example, the memory usage for the jobs ranged from 692 to 714 MB, and they all requested (and were allocated) 1 GB of memory.","title":"Job resource usage"},{"location":"materials/scaling/part1-ex3-log-files/#other-job-information","text":"See if you can come up with grep commands to gather the number of bytes sent and received by jobs (i.e. how much data was transferred to/from the access point). Here is some example output for comparison: [username@ap40]$ grep '' my-job.log 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760393 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job 760397 - Total Bytes Sent By Job 760395 - Total Bytes Sent By Job [username@ap40]$ grep '' my-job.log 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job 19240 - Total Bytes Received By Job Job log files may also contain additional information about held jobs or interrupted jobs. If you feel that your jobs are bouncing from idle to running and back to idle, or that they are otherwise not making as much progress as you expect, the log files are a good place to check. Though they might eventually become impossibly large to read line-by-line once you start scaling up, using grep to pull out specific lines and using sort and uniq to reduce the output can help you make sense of the information contained in the logs.","title":"Other job information"},{"location":"materials/software/part1-ex1-run-apptainer/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.1: Run and Explore Containers \u00b6 Objective : Run a container interactively Why learn this? : Being able to run a container directly allows you to confirm what is installed and whether any additional scripts or code will work in the context of the container. Setup \u00b6 Make sure you are logged into ap40.uw.osg-htc.org . For this exercise we will be using Apptainer containers maintained by OSG staff or existing containers on Docker Hub. We will set two environment variables that will help lighten the load on the Access Point as we work with containers: $ mkdir ~/apptainer_cache $ export APPTAINER_CACHEDIR = $HOME /apptainer_cache $ export TMPDIR = $HOME /apptainer_cache Exploring Apptainer Containers \u00b6 First, let's try to run a container from the OSG-Supported List . Find the full path for the ubuntu 22.04 container image. To run it, use this command: $ apptainer shell /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 It may take a few minutes to start - don't worry if this happens. Once the container starts, the prompt will change to either Singularity> or Apptainer> . Run ls and pwd . Where are you? Do you see your files? The apptainer shell command will automatically connect your home directory to the running container so you can use your files. How do we know we're in a different Linux environment? Try printing out the Linux version, or checking the version of common tools like gcc or Python: $ cat /etc/os-release $ gcc --version $ python3 --version Exit out of the container by typing exit . Type the same commands back on the normal Access Point. Should they give the same results as when typed in the container, or different? $ cat /etc/os-release $ gcc --version $ python3 --version Exploring Docker Containers \u00b6 The process for interactively running a Docker container will be very similar to an apptainer container. The main difference is a docker:// prefix before the container's identifying name. We are going to be using a Python image from Docker Hub . Click on the \"Tags\" tab to see all the different versions of this container that exists. Let's use version 3.10 . To run it interactively, use this command: $ apptainer shell docker://python:3.10 Once the container starts and the prompt changes, try running similar commands as above. What version of Linux is used in this container? Does the version of Python match what you expect, based on the name of the container? Once done, type exit to leave the container.","title":"1.1 - Run and Explore Apptainer Containers"},{"location":"materials/software/part1-ex1-run-apptainer/#software-exercise-11-run-and-explore-containers","text":"Objective : Run a container interactively Why learn this? : Being able to run a container directly allows you to confirm what is installed and whether any additional scripts or code will work in the context of the container.","title":"Software Exercise 1.1: Run and Explore Containers"},{"location":"materials/software/part1-ex1-run-apptainer/#setup","text":"Make sure you are logged into ap40.uw.osg-htc.org . For this exercise we will be using Apptainer containers maintained by OSG staff or existing containers on Docker Hub. We will set two environment variables that will help lighten the load on the Access Point as we work with containers: $ mkdir ~/apptainer_cache $ export APPTAINER_CACHEDIR = $HOME /apptainer_cache $ export TMPDIR = $HOME /apptainer_cache","title":"Setup"},{"location":"materials/software/part1-ex1-run-apptainer/#exploring-apptainer-containers","text":"First, let's try to run a container from the OSG-Supported List . Find the full path for the ubuntu 22.04 container image. To run it, use this command: $ apptainer shell /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 It may take a few minutes to start - don't worry if this happens. Once the container starts, the prompt will change to either Singularity> or Apptainer> . Run ls and pwd . Where are you? Do you see your files? The apptainer shell command will automatically connect your home directory to the running container so you can use your files. How do we know we're in a different Linux environment? Try printing out the Linux version, or checking the version of common tools like gcc or Python: $ cat /etc/os-release $ gcc --version $ python3 --version Exit out of the container by typing exit . Type the same commands back on the normal Access Point. Should they give the same results as when typed in the container, or different? $ cat /etc/os-release $ gcc --version $ python3 --version","title":"Exploring Apptainer Containers"},{"location":"materials/software/part1-ex1-run-apptainer/#exploring-docker-containers","text":"The process for interactively running a Docker container will be very similar to an apptainer container. The main difference is a docker:// prefix before the container's identifying name. We are going to be using a Python image from Docker Hub . Click on the \"Tags\" tab to see all the different versions of this container that exists. Let's use version 3.10 . To run it interactively, use this command: $ apptainer shell docker://python:3.10 Once the container starts and the prompt changes, try running similar commands as above. What version of Linux is used in this container? Does the version of Python match what you expect, based on the name of the container? Once done, type exit to leave the container.","title":"Exploring Docker Containers"},{"location":"materials/software/part1-ex2-apptainer-jobs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.2: Use Apptainer Containers in OSPool Jobs \u00b6 Objective : Submit a job that uses an existing apptainer container; compare default job environment with a specific container job environment. Why learn this? : By comparing a non-container and container job, you'll better understand what a container can do on the OSPool. This may also be how you end up submitting your jobs if you can find an existing apptainer container with your software. Default Environment \u00b6 First, let's run a job without a container to see what the typical job environment is. Create a bash script with the following lines: #!/bin/bash hostname cat /etc/os-release gcc --version python3 --version This will print out the version of Linux on the computer, the version of gcc , a common software compiler, and the version of Python 3. Make the script executable: $ chmod +x script.sh Run the script on the Access Point. $ ./script.sh What results did you get? Copy a submit file from a previous OSPool job and edit it so that the script you just wrote is the executable. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc or Python? Container Environment \u00b6 Now, let's try running that same script inside a container. For this job, we will use the OSG-provided Ubuntu \"Focal\" image, as we did in the previous exercise. The container_image submit file option will tell HTCondor to use this container for the job: universe = container container_image = /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 If the submit file you copied has something like requirements = (OSGVO_OS_STRING == \"RHEL 9\") , remove that. When you use containers, you should not specify an OS in the requirements as that will unnecessarily limit the number of resources you can run on. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc ? or Python? Experimenting With Other Containers \u00b6 Look at the list of OSG-Supported containers: OSG Supported Containers Try submitting a job that uses one of these containers. Change the executable script to explore different aspects of that container.","title":"1.2 - Use Apptainer Containers in OSPool Jobs"},{"location":"materials/software/part1-ex2-apptainer-jobs/#software-exercise-12-use-apptainer-containers-in-ospool-jobs","text":"Objective : Submit a job that uses an existing apptainer container; compare default job environment with a specific container job environment. Why learn this? : By comparing a non-container and container job, you'll better understand what a container can do on the OSPool. This may also be how you end up submitting your jobs if you can find an existing apptainer container with your software.","title":"Software Exercise 1.2: Use Apptainer Containers in OSPool Jobs"},{"location":"materials/software/part1-ex2-apptainer-jobs/#default-environment","text":"First, let's run a job without a container to see what the typical job environment is. Create a bash script with the following lines: #!/bin/bash hostname cat /etc/os-release gcc --version python3 --version This will print out the version of Linux on the computer, the version of gcc , a common software compiler, and the version of Python 3. Make the script executable: $ chmod +x script.sh Run the script on the Access Point. $ ./script.sh What results did you get? Copy a submit file from a previous OSPool job and edit it so that the script you just wrote is the executable. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc or Python?","title":"Default Environment"},{"location":"materials/software/part1-ex2-apptainer-jobs/#container-environment","text":"Now, let's try running that same script inside a container. For this job, we will use the OSG-provided Ubuntu \"Focal\" image, as we did in the previous exercise. The container_image submit file option will tell HTCondor to use this container for the job: universe = container container_image = /cvmfs/singularity.opensciencegrid.org/htc/ubuntu:22.04 If the submit file you copied has something like requirements = (OSGVO_OS_STRING == \"RHEL 9\") , remove that. When you use containers, you should not specify an OS in the requirements as that will unnecessarily limit the number of resources you can run on. Submit the job and read the standard output file when it completes. What version of Linux was used for the job? What is the version of gcc ? or Python?","title":"Container Environment"},{"location":"materials/software/part1-ex2-apptainer-jobs/#experimenting-with-other-containers","text":"Look at the list of OSG-Supported containers: OSG Supported Containers Try submitting a job that uses one of these containers. Change the executable script to explore different aspects of that container.","title":"Experimenting With Other Containers"},{"location":"materials/software/part1-ex3-docker-jobs/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 1.3: Use Docker Containers in OSPool Jobs \u00b6 Objective : Create a local copy of a Docker container, use it to submit a job. Why learn this? : Same as the previous exercise; this may also be how you end up submitting your jobs if you can find an existing Docker container with your software. Create Local Copy of Docker Container \u00b6 While it is technically possible to use a Docker container directly in a job, there are some good reasons for converting it to a local Apptainer container first. We'll do this with the same python:3.10 Docker container we used in the first exercise . To convert the Docker container to a local Apptainer container, run: $ apptainer build local-py310.sif docker://python:3.10 The first argument after build is the name of the new Apptainer container file, the second argument is what we're building from (in this case, Docker). Submit File and Executable \u00b6 Make a copy of your submit file from the previous container exercise or build from an existing submit file. Add the following lines to the submit file or modify existing lines to match the lines below: universe = container container_image = local-py310.sif Use the same executable as the previous exercise . Once these steps are done, submit the job. You might get a warning about using OSDF for container transfers - ignore this warning for now. Finding Docker Containers \u00b6 There are a lot of Docker containers on Docker Hub, but they are not all created equal. Anyone can create an account on Docker Hub and share container images there, so it\u2019s important to exercise caution when choosing a container image on Docker Hub. These are some indicators that a container image on Docker Hub is consistently maintained, functional and secure: The container image is updated regularly. The container image is associated with a well established company, community, or other group that is well-known. There is a Dockerfile or other listing of what has been installed to the container image. The container image page has documentation on how to use the container image. [^1] Given these indicators: Can you find a container on Docker Hub that would be useful for running Jupyter notebooks that use tensorflow? Does your chosen image meet at least 2 of the criteria above? [^1]: This list and previous text taken from Introduction to Docker","title":"1.3 - Use Docker Containers in OSPool Jobs"},{"location":"materials/software/part1-ex3-docker-jobs/#software-exercise-13-use-docker-containers-in-ospool-jobs","text":"Objective : Create a local copy of a Docker container, use it to submit a job. Why learn this? : Same as the previous exercise; this may also be how you end up submitting your jobs if you can find an existing Docker container with your software.","title":"Software Exercise 1.3: Use Docker Containers in OSPool Jobs"},{"location":"materials/software/part1-ex3-docker-jobs/#create-local-copy-of-docker-container","text":"While it is technically possible to use a Docker container directly in a job, there are some good reasons for converting it to a local Apptainer container first. We'll do this with the same python:3.10 Docker container we used in the first exercise . To convert the Docker container to a local Apptainer container, run: $ apptainer build local-py310.sif docker://python:3.10 The first argument after build is the name of the new Apptainer container file, the second argument is what we're building from (in this case, Docker).","title":"Create Local Copy of Docker Container"},{"location":"materials/software/part1-ex3-docker-jobs/#submit-file-and-executable","text":"Make a copy of your submit file from the previous container exercise or build from an existing submit file. Add the following lines to the submit file or modify existing lines to match the lines below: universe = container container_image = local-py310.sif Use the same executable as the previous exercise . Once these steps are done, submit the job. You might get a warning about using OSDF for container transfers - ignore this warning for now.","title":"Submit File and Executable"},{"location":"materials/software/part1-ex3-docker-jobs/#finding-docker-containers","text":"There are a lot of Docker containers on Docker Hub, but they are not all created equal. Anyone can create an account on Docker Hub and share container images there, so it\u2019s important to exercise caution when choosing a container image on Docker Hub. These are some indicators that a container image on Docker Hub is consistently maintained, functional and secure: The container image is updated regularly. The container image is associated with a well established company, community, or other group that is well-known. There is a Dockerfile or other listing of what has been installed to the container image. The container image page has documentation on how to use the container image. [^1] Given these indicators: Can you find a container on Docker Hub that would be useful for running Jupyter notebooks that use tensorflow? Does your chosen image meet at least 2 of the criteria above? [^1]: This list and previous text taken from Introduction to Docker","title":"Finding Docker Containers"},{"location":"materials/software/part1-ex4-apptainer-build/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.4: Build, Test, and Deploy an Apptainer Container \u00b6 Objective : to practice building and using a custom apptainer container Why learn this? : You may need to go through this process if you want to use a container for your jobs and can't find one that has what you need. Motivating Script \u00b6 Create a script called hello-cow.py : #!/usr/bin/env python3 import cowsay cowsay.cow('Hello OSG User School') Give it executable permissions: $ chmod +x hello-cow.py Try running the script: $ ./hello-cow.py It will likely fail, because the cowsay library isn't installed. This is a scenario where we will want to build our own container that includes a base Python installation and the cowsay Python library. Preparing a Definition File \u00b6 We can describe our desired Apptainer image in a special format called a definition file . This has special keywords that will direct Apptainer when it builds the container image. Create a file called py-cowsay.def with these contents: Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 %post apt-get update -y apt-get install -y \\ python3-pip \\ python3-numpy python3 -m pip install cowsay Note that we are starting with the same ubuntu base we used in previous exercises. The %post statement includes our installation commands, including updating the pip and numpy packages, and then using pip to install cowsay . To learn more about definition files, see Exercise 3.1 Build the Container \u00b6 Once the definition file is complete, we can build the container. Run the following command to build the container: $ apptainer build py-cowsay.sif py-cowsay.def As with the Docker image in the previous exercise , the first argument is the name to give to the newly create image file and the second argument is how to build the container image - in this case, the definition file. Testing the Image Locally \u00b6 Do you remember how to interactively test an image? Look back at Exercise 1.1 and guess what command would allow us to test our new container. Try running: $ apptainer shell py-cowsay.sif Then try running the hello-cow.py script: apptainer> ./hello-cow.py If it produces an output, our container works! We can now exit (by typing exit ) and submit a job. Submit a Job \u00b6 Make a copy of a submit file from a previous exercise in this section. Can you guess what options need to be used or modified? Make sure you have the following (in addition to log , error , output and CPU and memory requests): universe = container container_image = py-cowsay.sif executable = hello-cow.py Submit the job and verify the output when it completes. ______________________ | Hello OSG User School! | ====================== \\ \\ ^__^ (oo)\\_______ (__)\\ )\\/\\ ||----w | || ||","title":"1.4 - Build, Test, and Deploy an Apptainer Container"},{"location":"materials/software/part1-ex4-apptainer-build/#software-exercise-14-build-test-and-deploy-an-apptainer-container","text":"Objective : to practice building and using a custom apptainer container Why learn this? : You may need to go through this process if you want to use a container for your jobs and can't find one that has what you need.","title":"Software Exercise 1.4: Build, Test, and Deploy an Apptainer Container"},{"location":"materials/software/part1-ex4-apptainer-build/#motivating-script","text":"Create a script called hello-cow.py : #!/usr/bin/env python3 import cowsay cowsay.cow('Hello OSG User School') Give it executable permissions: $ chmod +x hello-cow.py Try running the script: $ ./hello-cow.py It will likely fail, because the cowsay library isn't installed. This is a scenario where we will want to build our own container that includes a base Python installation and the cowsay Python library.","title":"Motivating Script"},{"location":"materials/software/part1-ex4-apptainer-build/#preparing-a-definition-file","text":"We can describe our desired Apptainer image in a special format called a definition file . This has special keywords that will direct Apptainer when it builds the container image. Create a file called py-cowsay.def with these contents: Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 %post apt-get update -y apt-get install -y \\ python3-pip \\ python3-numpy python3 -m pip install cowsay Note that we are starting with the same ubuntu base we used in previous exercises. The %post statement includes our installation commands, including updating the pip and numpy packages, and then using pip to install cowsay . To learn more about definition files, see Exercise 3.1","title":"Preparing a Definition File"},{"location":"materials/software/part1-ex4-apptainer-build/#build-the-container","text":"Once the definition file is complete, we can build the container. Run the following command to build the container: $ apptainer build py-cowsay.sif py-cowsay.def As with the Docker image in the previous exercise , the first argument is the name to give to the newly create image file and the second argument is how to build the container image - in this case, the definition file.","title":"Build the Container"},{"location":"materials/software/part1-ex4-apptainer-build/#testing-the-image-locally","text":"Do you remember how to interactively test an image? Look back at Exercise 1.1 and guess what command would allow us to test our new container. Try running: $ apptainer shell py-cowsay.sif Then try running the hello-cow.py script: apptainer> ./hello-cow.py If it produces an output, our container works! We can now exit (by typing exit ) and submit a job.","title":"Testing the Image Locally"},{"location":"materials/software/part1-ex4-apptainer-build/#submit-a-job","text":"Make a copy of a submit file from a previous exercise in this section. Can you guess what options need to be used or modified? Make sure you have the following (in addition to log , error , output and CPU and memory requests): universe = container container_image = py-cowsay.sif executable = hello-cow.py Submit the job and verify the output when it completes. ______________________ | Hello OSG User School! | ====================== \\ \\ ^__^ (oo)\\_______ (__)\\ )\\/\\ ||----w | || ||","title":"Submit a Job"},{"location":"materials/software/part1-ex5-pick-an-option/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 1.5 - Choose Software Options \u00b6 Objective : Decide how you want to make your software portable Why learn this? : This is the next step to getting your own research jobs running on the OSPool! Know Your Software \u00b6 Pick at least one software you want to use on the OSPool as a test subject. Then: Find the download and/or installation page and read through the instructions and options there. Is the software available as a binary download, or will you need to run some kind of command to install it or compile it from source? If there are multiple download/installation options, which is which? What pre-requisites does this software need to be installed? Example 1: an R package will require a base R installation Example 2: some codes require that a library called the \"Gnu Scientific Library (GSL) be already installed on your computer\" Choose a Strategy \u00b6 Are there any existing containers that contain this software already? Explore OSG-Supported Containers Explore DockerHub , for example: miniconda rocker jupyter nvidia (and many more!) If yes, try using this container first, as shown in Exercise 1.2 and Exercise 1.3 Is there a simple download or easy compilation process? If so, can you download the software and use it via a wrapper script? See the exercises from Part 4 ( Download Software Files , Use a Wrapper Script , Wrapper Script Arguments ). To learn more about using this approach for specific softwares, see the examples in Part 5 . Are you using conda? See the specific example in Exercise 5.3 If neither of the above options works (which may be true for more software!), you may want to build your own container. If you want to just use this container on the OSPool, build an Apptainer container as described in Exercise 1.4 and with more information in Exercise 3.1 If you want to use the container on your own computer or share with others who would use it on a laptop or desktop, look at the Docker container example in Exercise 3.2 . Don't do ALL of the software exercises in parts 3 - 5! Instead, choose the section(s) that makes sense based on how you want to manage your software. Talk to the School instructors to help make this decision if you are unsure. Create an Executable \u00b6 Regardless of which approach you use, check out the Build an HTC-Friendly Executable exercise for some tips on how to make your script more robust and easy to use with multiple jobs.","title":"1.5 - Choose Software Options"},{"location":"materials/software/part1-ex5-pick-an-option/#software-exercise-15-choose-software-options","text":"Objective : Decide how you want to make your software portable Why learn this? : This is the next step to getting your own research jobs running on the OSPool!","title":"Software Exercise 1.5 - Choose Software Options"},{"location":"materials/software/part1-ex5-pick-an-option/#know-your-software","text":"Pick at least one software you want to use on the OSPool as a test subject. Then: Find the download and/or installation page and read through the instructions and options there. Is the software available as a binary download, or will you need to run some kind of command to install it or compile it from source? If there are multiple download/installation options, which is which? What pre-requisites does this software need to be installed? Example 1: an R package will require a base R installation Example 2: some codes require that a library called the \"Gnu Scientific Library (GSL) be already installed on your computer\"","title":"Know Your Software"},{"location":"materials/software/part1-ex5-pick-an-option/#choose-a-strategy","text":"Are there any existing containers that contain this software already? Explore OSG-Supported Containers Explore DockerHub , for example: miniconda rocker jupyter nvidia (and many more!) If yes, try using this container first, as shown in Exercise 1.2 and Exercise 1.3 Is there a simple download or easy compilation process? If so, can you download the software and use it via a wrapper script? See the exercises from Part 4 ( Download Software Files , Use a Wrapper Script , Wrapper Script Arguments ). To learn more about using this approach for specific softwares, see the examples in Part 5 . Are you using conda? See the specific example in Exercise 5.3 If neither of the above options works (which may be true for more software!), you may want to build your own container. If you want to just use this container on the OSPool, build an Apptainer container as described in Exercise 1.4 and with more information in Exercise 3.1 If you want to use the container on your own computer or share with others who would use it on a laptop or desktop, look at the Docker container example in Exercise 3.2 . Don't do ALL of the software exercises in parts 3 - 5! Instead, choose the section(s) that makes sense based on how you want to manage your software. Talk to the School instructors to help make this decision if you are unsure.","title":"Choose a Strategy"},{"location":"materials/software/part1-ex5-pick-an-option/#create-an-executable","text":"Regardless of which approach you use, check out the Build an HTC-Friendly Executable exercise for some tips on how to make your script more robust and easy to use with multiple jobs.","title":"Create an Executable"},{"location":"materials/software/part2-ex1-build-executable/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 2.1 Build an HTC-Friendly Executable \u00b6 Objective : Modify an existing script to include arguments and headers. Why learn this? : A little bit of preparation can make it easier to reuse the same script over and over to run many jobs. Setup \u00b6 Download and unzip a set of Protein Data Bank (PDB) files: $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/alkanes.tar.gz $ tar -xzf alkanes.tar.gz For these exercises, we are going to run a command that counts the number of atoms in the PDB file . Run it now as an example: $ grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb Add a Header \u00b6 To create a basic script, you can put the command above into a file called get_atoms.sh . To make it clear what language we expect to use to run the script, we will add the following header on the first line: `#!/bin/bash #!/bin/bash grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb The \"header\" of #!/bin/bash will tell the computer that this is a bash shell script and can be run in the same way that you would run individual commands on the command line. We use /bin/bash instead of just bash because that is the full path to the bash software file. Other languages We can use the same principle for any scripting language. For example, the header for a Python script could be either #!/usr/bin/python3 or #!/usr/bin/env python3 . Similar logic works for perl, R, julia and other scripting languages. Can you now run the script? $ ./get_atoms.sh This gives \"permission denied.\" Let's add executable permissions to the script and try again: $ chmod +x get_atoms.sh $ ./get_atoms.sh Incorporate Arguments \u00b6 Can you imagine trying to run this script on all of our pdb files? It would be tedious to edit it for each one, even for only six inputs. Instead, we should add arguments to the script to make it easy to reuse the script. Any information in a script or executable that is going to change or vary across jobs or analyses should likely be turned into an argument that is specified on the command line. In our example above, which pieces of the script are likely to change or vary? The name of the input file ( cubane.pdb ) and output file ( atoms_cubane.pdb ) should be turned into arguments. Can you envision what our script should look like if we ran it with input arguments? Let's say we want to be able to run the following command: $ ./get_atoms.sh cubane.pdb atoms_cubane.pdb In order to get arguments from the command line into the script, you have to use special variables in the script. In bash, these are $1 (for the first argument), $2 (for the second argument) and so on. Try to figure out where these should go in our get_atoms.sh script. Other Languages Each language is going to have its own syntax for reading command line arguments into the script. In Python, sys.argv is a basic method, and more advanced libraries like argparse can be used. In R, the commandArgs() function can do this. Google \"command line arguments in ______\" to find the right syntax for your language of choice! A first pass at adding arguments might look like this: #!/bin/bash grep ATOM $1 | wc -l > $2 Try running it as described above. Does it work? While we now have arguments, we have lost some of the readability of our script. The numbers $1 and $2 are not very meaningful in themselves! Let's rewrite the script to assign the arguments to meaningful variable names: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=$2 grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} Why curly brackets? You'll notice above that we started using curly brackets around our variables. While you technically don't need them ( $PDB_INPUT would also be fine), using them makes the name of the variable (compared to other text) completely clear. This is especially useful when combining variables with underscores. There is one final place where we could optimize this script. If we want our output files to always have the same naming convention, based on the input file name, then we shouldn't have a separate argument for that -- it's asking for typos. Instead, we should use variables inside the script to construct the output file name, based on the input file. That will look like this: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=atoms_${PDB_INPUT} grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} You may want to construct other variables, like paths and filenames in this way. But it depends on how you want to use the script! If we want the flexibility of specifying a custom output file name, then we should undo this last change so it can be treated as a separate argument. Your Work \u00b6 Are you using a scripting language where you could add a header to your main script? If so, what should it be? What items in your main code or commands are changing? Do you need to add arguments to your code?","title":"2.1 - Build an HTC-Friendly Executable"},{"location":"materials/software/part2-ex1-build-executable/#software-exercise-21-build-an-htc-friendly-executable","text":"Objective : Modify an existing script to include arguments and headers. Why learn this? : A little bit of preparation can make it easier to reuse the same script over and over to run many jobs.","title":"Software Exercise 2.1 Build an HTC-Friendly Executable"},{"location":"materials/software/part2-ex1-build-executable/#setup","text":"Download and unzip a set of Protein Data Bank (PDB) files: $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/alkanes.tar.gz $ tar -xzf alkanes.tar.gz For these exercises, we are going to run a command that counts the number of atoms in the PDB file . Run it now as an example: $ grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb","title":"Setup"},{"location":"materials/software/part2-ex1-build-executable/#add-a-header","text":"To create a basic script, you can put the command above into a file called get_atoms.sh . To make it clear what language we expect to use to run the script, we will add the following header on the first line: `#!/bin/bash #!/bin/bash grep ATOM cubane.pdb | wc -l > atoms_cubane.pdb The \"header\" of #!/bin/bash will tell the computer that this is a bash shell script and can be run in the same way that you would run individual commands on the command line. We use /bin/bash instead of just bash because that is the full path to the bash software file. Other languages We can use the same principle for any scripting language. For example, the header for a Python script could be either #!/usr/bin/python3 or #!/usr/bin/env python3 . Similar logic works for perl, R, julia and other scripting languages. Can you now run the script? $ ./get_atoms.sh This gives \"permission denied.\" Let's add executable permissions to the script and try again: $ chmod +x get_atoms.sh $ ./get_atoms.sh","title":"Add a Header"},{"location":"materials/software/part2-ex1-build-executable/#incorporate-arguments","text":"Can you imagine trying to run this script on all of our pdb files? It would be tedious to edit it for each one, even for only six inputs. Instead, we should add arguments to the script to make it easy to reuse the script. Any information in a script or executable that is going to change or vary across jobs or analyses should likely be turned into an argument that is specified on the command line. In our example above, which pieces of the script are likely to change or vary? The name of the input file ( cubane.pdb ) and output file ( atoms_cubane.pdb ) should be turned into arguments. Can you envision what our script should look like if we ran it with input arguments? Let's say we want to be able to run the following command: $ ./get_atoms.sh cubane.pdb atoms_cubane.pdb In order to get arguments from the command line into the script, you have to use special variables in the script. In bash, these are $1 (for the first argument), $2 (for the second argument) and so on. Try to figure out where these should go in our get_atoms.sh script. Other Languages Each language is going to have its own syntax for reading command line arguments into the script. In Python, sys.argv is a basic method, and more advanced libraries like argparse can be used. In R, the commandArgs() function can do this. Google \"command line arguments in ______\" to find the right syntax for your language of choice! A first pass at adding arguments might look like this: #!/bin/bash grep ATOM $1 | wc -l > $2 Try running it as described above. Does it work? While we now have arguments, we have lost some of the readability of our script. The numbers $1 and $2 are not very meaningful in themselves! Let's rewrite the script to assign the arguments to meaningful variable names: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=$2 grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} Why curly brackets? You'll notice above that we started using curly brackets around our variables. While you technically don't need them ( $PDB_INPUT would also be fine), using them makes the name of the variable (compared to other text) completely clear. This is especially useful when combining variables with underscores. There is one final place where we could optimize this script. If we want our output files to always have the same naming convention, based on the input file name, then we shouldn't have a separate argument for that -- it's asking for typos. Instead, we should use variables inside the script to construct the output file name, based on the input file. That will look like this: #!/bin/bash PDB_INPUT=$1 PDB_ATOM_OUTPUT=atoms_${PDB_INPUT} grep ATOM ${PDB_INPUT} | wc -l > ${PDB_ATOM_OUTPUT} You may want to construct other variables, like paths and filenames in this way. But it depends on how you want to use the script! If we want the flexibility of specifying a custom output file name, then we should undo this last change so it can be treated as a separate argument.","title":"Incorporate Arguments"},{"location":"materials/software/part2-ex1-build-executable/#your-work","text":"Are you using a scripting language where you could add a header to your main script? If so, what should it be? What items in your main code or commands are changing? Do you need to add arguments to your code?","title":"Your Work"},{"location":"materials/software/part3-ex1-apptainer-recipes/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 3.1: Create an Apptainer Definition File \u00b6 Objective : Describe each major section of an Apptainer Definition file. Why learn this? : When building your own containers, it is helpful to understand the basic options and syntax of the \"build\" or definition file. Section Bootstrap/From %files %post %env Where to start \u00b6 Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 A custom container always is always built on an existing container. It is common to use a container on Docker Hub, or in this case, hub.opensciencegrid.org. These lines tell Apptainer to pull the pre-existing image from the hub, and to use it as the base for the container that will be built using this definition file. When choosing a base container, try to find one that has most of what you need - for example, if you want to install R packages, try to find a container that already has R installed. Files needed for building or running \u00b6 %files source_code.tar.gz /opt install.R If you need specific files for the installation (like source code) or for the job to execute (like small data files or scripts), they can be copied into the container under the %files section. The first item on a line is what to copy (from your computer) and the optional second item is where it should be copied in the container. Normally the files being copied are in your local working directory where you run the build command. Commands to install \u00b6 %post apt-get update -y apt-get install -y \\ build-essential \\ cmake \\ g++ \\ r-base-dev install2.r tidyverse This is where most of the installation happens. You can use any shell command here that will work in the base container to install software. These commands might include: - Linux installation tools like apt or yum - Scripting specific installers like pip , conda or install.packages() - Shell commands like tar , configure , make Different distributions of Linux often have distinct sets of tools for installing software. The installers for various common Linux distributions are listed below: Ubuntu: apt or apt-get Debian: deb CentOS: yum A web search for \u201cinstall X on Y Linux\u201d is usually a good start for common software installation tasks. [^1] When installing to a custom location, do not install to a home directory. This is likely to get overwritten when the container is run. Instead, /opt is the best directory for custom installations. Environment \u00b6 %environment PATH=/opt/mycode/bin:$PATH JAVA_HOME=/opt/java-1.8 To set environment variables (especially useful for software in a custom location), use the %environment section of the definition file. [^1]: This text and previous list taken from Introduction to Docker","title":"3.1 - Create an Apptainer Definition Files"},{"location":"materials/software/part3-ex1-apptainer-recipes/#software-exercise-31-create-an-apptainer-definition-file","text":"Objective : Describe each major section of an Apptainer Definition file. Why learn this? : When building your own containers, it is helpful to understand the basic options and syntax of the \"build\" or definition file. Section Bootstrap/From %files %post %env","title":"Software Exercise 3.1: Create an Apptainer Definition File"},{"location":"materials/software/part3-ex1-apptainer-recipes/#where-to-start","text":"Bootstrap: docker From: hub.opensciencegrid.org/htc/ubuntu:22.04 A custom container always is always built on an existing container. It is common to use a container on Docker Hub, or in this case, hub.opensciencegrid.org. These lines tell Apptainer to pull the pre-existing image from the hub, and to use it as the base for the container that will be built using this definition file. When choosing a base container, try to find one that has most of what you need - for example, if you want to install R packages, try to find a container that already has R installed.","title":"Where to start"},{"location":"materials/software/part3-ex1-apptainer-recipes/#files-needed-for-building-or-running","text":"%files source_code.tar.gz /opt install.R If you need specific files for the installation (like source code) or for the job to execute (like small data files or scripts), they can be copied into the container under the %files section. The first item on a line is what to copy (from your computer) and the optional second item is where it should be copied in the container. Normally the files being copied are in your local working directory where you run the build command.","title":"Files needed for building or running"},{"location":"materials/software/part3-ex1-apptainer-recipes/#commands-to-install","text":"%post apt-get update -y apt-get install -y \\ build-essential \\ cmake \\ g++ \\ r-base-dev install2.r tidyverse This is where most of the installation happens. You can use any shell command here that will work in the base container to install software. These commands might include: - Linux installation tools like apt or yum - Scripting specific installers like pip , conda or install.packages() - Shell commands like tar , configure , make Different distributions of Linux often have distinct sets of tools for installing software. The installers for various common Linux distributions are listed below: Ubuntu: apt or apt-get Debian: deb CentOS: yum A web search for \u201cinstall X on Y Linux\u201d is usually a good start for common software installation tasks. [^1] When installing to a custom location, do not install to a home directory. This is likely to get overwritten when the container is run. Instead, /opt is the best directory for custom installations.","title":"Commands to install"},{"location":"materials/software/part3-ex1-apptainer-recipes/#environment","text":"%environment PATH=/opt/mycode/bin:$PATH JAVA_HOME=/opt/java-1.8 To set environment variables (especially useful for software in a custom location), use the %environment section of the definition file. [^1]: This text and previous list taken from Introduction to Docker","title":"Environment"},{"location":"materials/software/part3-ex2-docker-build/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 3.2: Build Your Own Docker Container (Optional) \u00b6 Objective : Build a custom Docker container with numpy and use it in a job Why learn this? : Docker containers can be run on both your laptop and OSPool. DockerHub also provides a convenient platform for sharing containers. If you want to use a custom container, run across platforms, and/or share a container amongst a group, building in Docker first is a good approach. Python Script \u00b6 For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library. This exercise will walk you through the steps to build your own Docker container based on Python, with the numpy Python library added on. Getting Set Up \u00b6 Before building your own Docker container, you need to go through the following set up steps: Install Docker Dekstop on your computer. Docker Desktop page You may need to create a Docker Hub user name to download Docker Desktop; if not created at that step, create a user name for Docker Hub now. (Optional): Once Docker is up and running on your computer, you are welcome to take some time to explore the basics of downloading and running a container, as shown in the initial sections of this Docker lesson: Introduction to Docker However, this isn't strictly necessary for building your own container. Building a Container \u00b6 In order to make our container reproducible, we will be using Docker's capability to build a container image from a specification file. First, create an empty build directory on your computer , not the Access Points. In the build directory, create a file called Dockerfile (no file extension!) with the following contents: # Start with this image as a \"base\". # It's as if all the commands that created that image were inserted here. # Always use a specific tag like \"4.10.3\", never \"latest\"! # The version referenced by \"latest\" can change, so the build will be # more stable when building from a specific version tag. FROM continuumio/miniconda3:4.10.3 # Use RUN to execute commands inside the image as it is being built up. RUN conda install --yes numpy # RUN multiple commands together. # Try to always \"clean up\" after yourself to reduce the final size of your image. RUN apt-get update \\ && apt-get --yes install --no-install-recommends graphviz \\ && apt-get --yes clean \\ && rm -rf /var/lib/apt/lists/* This is our specification file and provides Docker with the information it needs to build our new container. There are other options besides FROM and RUN ; see the Docker documentation for more information. Note that our container is starting from an existing container continuumio/miniconda3:4.10.3 . This container is produced by the continuumio organization; the number 4.10.3 indicates the container version. When we create our new container, we will want to use a similar naming scheme of: USERNAME/CONTAINER:VERSIONTAG In what follows, you will want to replace USERNAME with your DockerHub user name. The CONTAINER name and VERSIONTAG are your choice; in what follows, we will use py3-numpy as the container name and 2024-08 as the version tag. To build and name the new container, open a command line window on your computer where you can run Docker commands. Use the cd command to change your working directory to the build directory with the Dockerfile inside. $ docker build -t USERNAME/py3-numpy:2024-08 . Note the . at the end of the command! This indicates that we're using the current directory as our build environment, including the Dockerfile inside. Upload Container and Submit Job \u00b6 Right now the container image only exists on your computer. To use it in CHTC or elsewhere, it needs to be added to a public registry like Docker Hub. To put your container image in Docker Hub, use the docker push command on the command line: $ docker push USERNAME/py3-numpy:2024-08 If the push doesn't work, you may need to run docker login first, enter your Docker Hub username and password and then try the push again. Once your container image is in DockerHub, you can use it in jobs as described in Exercise 1.3 . Thanks to Josh Karpel for providing the original sample Dockerfile !","title":"3.2 - Build Your Own Docker Container"},{"location":"materials/software/part3-ex2-docker-build/#software-exercise-32-build-your-own-docker-container-optional","text":"Objective : Build a custom Docker container with numpy and use it in a job Why learn this? : Docker containers can be run on both your laptop and OSPool. DockerHub also provides a convenient platform for sharing containers. If you want to use a custom container, run across platforms, and/or share a container amongst a group, building in Docker first is a good approach.","title":"Software Exercise 3.2: Build Your Own Docker Container (Optional)"},{"location":"materials/software/part3-ex2-docker-build/#python-script","text":"For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library. This exercise will walk you through the steps to build your own Docker container based on Python, with the numpy Python library added on.","title":"Python Script"},{"location":"materials/software/part3-ex2-docker-build/#getting-set-up","text":"Before building your own Docker container, you need to go through the following set up steps: Install Docker Dekstop on your computer. Docker Desktop page You may need to create a Docker Hub user name to download Docker Desktop; if not created at that step, create a user name for Docker Hub now. (Optional): Once Docker is up and running on your computer, you are welcome to take some time to explore the basics of downloading and running a container, as shown in the initial sections of this Docker lesson: Introduction to Docker However, this isn't strictly necessary for building your own container.","title":"Getting Set Up"},{"location":"materials/software/part3-ex2-docker-build/#building-a-container","text":"In order to make our container reproducible, we will be using Docker's capability to build a container image from a specification file. First, create an empty build directory on your computer , not the Access Points. In the build directory, create a file called Dockerfile (no file extension!) with the following contents: # Start with this image as a \"base\". # It's as if all the commands that created that image were inserted here. # Always use a specific tag like \"4.10.3\", never \"latest\"! # The version referenced by \"latest\" can change, so the build will be # more stable when building from a specific version tag. FROM continuumio/miniconda3:4.10.3 # Use RUN to execute commands inside the image as it is being built up. RUN conda install --yes numpy # RUN multiple commands together. # Try to always \"clean up\" after yourself to reduce the final size of your image. RUN apt-get update \\ && apt-get --yes install --no-install-recommends graphviz \\ && apt-get --yes clean \\ && rm -rf /var/lib/apt/lists/* This is our specification file and provides Docker with the information it needs to build our new container. There are other options besides FROM and RUN ; see the Docker documentation for more information. Note that our container is starting from an existing container continuumio/miniconda3:4.10.3 . This container is produced by the continuumio organization; the number 4.10.3 indicates the container version. When we create our new container, we will want to use a similar naming scheme of: USERNAME/CONTAINER:VERSIONTAG In what follows, you will want to replace USERNAME with your DockerHub user name. The CONTAINER name and VERSIONTAG are your choice; in what follows, we will use py3-numpy as the container name and 2024-08 as the version tag. To build and name the new container, open a command line window on your computer where you can run Docker commands. Use the cd command to change your working directory to the build directory with the Dockerfile inside. $ docker build -t USERNAME/py3-numpy:2024-08 . Note the . at the end of the command! This indicates that we're using the current directory as our build environment, including the Dockerfile inside.","title":"Building a Container"},{"location":"materials/software/part3-ex2-docker-build/#upload-container-and-submit-job","text":"Right now the container image only exists on your computer. To use it in CHTC or elsewhere, it needs to be added to a public registry like Docker Hub. To put your container image in Docker Hub, use the docker push command on the command line: $ docker push USERNAME/py3-numpy:2024-08 If the push doesn't work, you may need to run docker login first, enter your Docker Hub username and password and then try the push again. Once your container image is in DockerHub, you can use it in jobs as described in Exercise 1.3 . Thanks to Josh Karpel for providing the original sample Dockerfile !","title":"Upload Container and Submit Job"},{"location":"materials/software/part4-ex1-download/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 4.1: Using a Pre-compiled Binary \u00b6 Objective : Identify software that can be downloaded; download it and use it to run a job. Why learn this? : Some software doesn't require much \"installation\" - you can just download it and run. Recognizing when this is possible can save you time. Our Software Example \u00b6 The software we will be using for this example is a common tool for aligning genome and protein sequences against a reference database, the BLAST program. Search the internet for the BLAST software. Searches might include \"blast executable or \"download blast software\". Hopefully these searches will lead you to a BLAST website page that looks like this: Click on the title that says \"Download BLAST\" and then look for the link that has the latest installation and source code . This will either open a page in a web browser that looks like this: Or you will be asked to open the link in your file browser (choose the Connect as Guest option): In either case, you should end up on a page with a list of each version of BLAST that is available for different operating systems. We could download the source and compile it ourselves, but instead, we're going to use one of the pre-built binaries. Before proceeding, look at the list of downloads and try to determine which one you want. Based on our operating system, we want to use the Linux binary, which is labelled with the x64-linux suffix. All the other links are either for source code or other operating systems. On the Access Point, create a directory for this exercise. Then download the appropriate tar.gz file and un-tar/decompress it it. If you want to do this all from the command line, the sequence will look like this (using wget as the download command.) user@login $ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.15.0/ncbi-blast-2.15.0+-x64-linux.tar.gz user@login $ tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz We're going to be using the blastx binary in our job. Where is it in the directory you just decompressed? Copy the Input Files \u00b6 To run BLAST, we need an input file and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Download these files to your current directory: username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: username@login $ tar -xzf pdbaa.tar.gz Submitting the Job \u00b6 We now have our program (the pre-compiled blastx binary) and our input files, so all that remains is to create the submit file. The form of a typical blastx command looks something like this: blastx -db -query -out Copy a submit file from one of the Day 1 exercises or previous software exercises to use for this exercise. Think about which lines you will need to change or add to your submit file in order to submit the job successfully. In particular: What is the executable? How can you indicate the entire command line sequence above? Which files need to be transferred in addition to the executable? Does this job require a certain type of operating system? Do you have any idea how much memory or disk to request? Try to answer these questions and modify your submit file appropriately. Once you have done all you can, check your submit file against the lines below, which contain the exact components to run this particular job. The executable is blastx , which is located in the bin directory of our downloaded BLAST directory. We need to use the arguments line in the submit file to express the rest of the command. executable = ncbi-blast-2.15.0+/bin/blastx arguments = -db pdbaa/pdbaa -query mouse.fa -out results.txt The BLAST program requires our input file and database, so they must be transferred with transfer_input_files . transfer_input_files = pdbaa, mouse.fa Let's assume that we've run this program before, and we know that 1GB of disk and 1GB of memory will be MORE than enough (the 'log' file will tell us how accurate we are, after the job runs): request_memory = 1GB request_disk = 1GB Submit the blast job using condor_submit . Once the job starts, it should run in just a few minutes and produce a file called results.txt .","title":"4.1 - Download and Use Compiled Software"},{"location":"materials/software/part4-ex1-download/#software-exercise-41-using-a-pre-compiled-binary","text":"Objective : Identify software that can be downloaded; download it and use it to run a job. Why learn this? : Some software doesn't require much \"installation\" - you can just download it and run. Recognizing when this is possible can save you time.","title":"Software Exercise 4.1: Using a Pre-compiled Binary"},{"location":"materials/software/part4-ex1-download/#our-software-example","text":"The software we will be using for this example is a common tool for aligning genome and protein sequences against a reference database, the BLAST program. Search the internet for the BLAST software. Searches might include \"blast executable or \"download blast software\". Hopefully these searches will lead you to a BLAST website page that looks like this: Click on the title that says \"Download BLAST\" and then look for the link that has the latest installation and source code . This will either open a page in a web browser that looks like this: Or you will be asked to open the link in your file browser (choose the Connect as Guest option): In either case, you should end up on a page with a list of each version of BLAST that is available for different operating systems. We could download the source and compile it ourselves, but instead, we're going to use one of the pre-built binaries. Before proceeding, look at the list of downloads and try to determine which one you want. Based on our operating system, we want to use the Linux binary, which is labelled with the x64-linux suffix. All the other links are either for source code or other operating systems. On the Access Point, create a directory for this exercise. Then download the appropriate tar.gz file and un-tar/decompress it it. If you want to do this all from the command line, the sequence will look like this (using wget as the download command.) user@login $ wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.15.0/ncbi-blast-2.15.0+-x64-linux.tar.gz user@login $ tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz We're going to be using the blastx binary in our job. Where is it in the directory you just decompressed?","title":"Our Software Example"},{"location":"materials/software/part4-ex1-download/#copy-the-input-files","text":"To run BLAST, we need an input file and reference database. For this example, we'll use the \"pdbaa\" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information. Download these files to your current directory: username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz username@login $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa Untar the pdbaa database: username@login $ tar -xzf pdbaa.tar.gz","title":"Copy the Input Files"},{"location":"materials/software/part4-ex1-download/#submitting-the-job","text":"We now have our program (the pre-compiled blastx binary) and our input files, so all that remains is to create the submit file. The form of a typical blastx command looks something like this: blastx -db -query -out Copy a submit file from one of the Day 1 exercises or previous software exercises to use for this exercise. Think about which lines you will need to change or add to your submit file in order to submit the job successfully. In particular: What is the executable? How can you indicate the entire command line sequence above? Which files need to be transferred in addition to the executable? Does this job require a certain type of operating system? Do you have any idea how much memory or disk to request? Try to answer these questions and modify your submit file appropriately. Once you have done all you can, check your submit file against the lines below, which contain the exact components to run this particular job. The executable is blastx , which is located in the bin directory of our downloaded BLAST directory. We need to use the arguments line in the submit file to express the rest of the command. executable = ncbi-blast-2.15.0+/bin/blastx arguments = -db pdbaa/pdbaa -query mouse.fa -out results.txt The BLAST program requires our input file and database, so they must be transferred with transfer_input_files . transfer_input_files = pdbaa, mouse.fa Let's assume that we've run this program before, and we know that 1GB of disk and 1GB of memory will be MORE than enough (the 'log' file will tell us how accurate we are, after the job runs): request_memory = 1GB request_disk = 1GB Submit the blast job using condor_submit . Once the job starts, it should run in just a few minutes and produce a file called results.txt .","title":"Submitting the Job"},{"location":"materials/software/part4-ex2-wrapper/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 4.2: Writing a Wrapper Script \u00b6 Objective : Run downloaded software files via an intermediate, \"wrapper\" script. Why learn this? : This change is a good test of your general HTCondor knowledge and how to translate between executable and submit file. Using wrapper scripts is also a common practice for managing what happens in a job. Background \u00b6 Wrapper scripts are a useful tool for running software that can't be compiled into one piece, needs to be installed with every job, or just for running extra steps. A wrapper script can either install the software from the source code, or use an already existing software (as in this exercise). Not only does this portability technique work with almost any kind of software that can be locally installed, it also allows for a great deal of control and flexibility for what happens within your job. Once you can write a script to handle your software (and often your data as well), you can submit a large variety of workflows to a distributed computing system like the Open Science Grid. For this exercise, we will write a wrapper script as an alternate way to run the same job as the previous exercise. Wrapper Script, part 1 \u00b6 Our wrapper script will be a bash script that runs several commands. In the same directory as the last exercise, make a file called run_blast.sh . The first line we'll place in the script is the basic command for running blast. Based on our previous submit file, what command needs to go into the script? Once you have an idea, check against the example below: #!/bin/bash ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results.txt Submit File Changes \u00b6 We now need to make some changes to our submit file. Make a copy of your previous submit file and open it to edit. Since we are now using a wrapper script, that will be our job's executable. Replace the original blastx exeuctable with the name of our wrapper script and comment out the arguments line. executable = run_blast.sh #arguments = Note that since the blastx program is no longer listed as the executable, it will be need to be included in transfer_input_files . Instead of transferring just that program, we will transfer the original downloaded tar.gz file. To achieve efficiency, we'll also transfer the pdbaa database as the original tar.gz file instead of as the unzipped folder: transfer_input_files = pdbaa.tar.gz, mouse.fa, ncbi-blast-2.15.0+-x64-linux.tar.gz If you really want to be on top of things, look at the log file for the last exercise, and update your memory and disk requests to be just slightly above the actual \"Usage\" values in the log. Before submitting, make sure to make the below additional changes to the wrapper script! Wrapper Script, part 2 \u00b6 Now that our database and BLAST software are being transferred to the job as tar.gz files, our script needs to accommodate. Opening your run_blast.sh script, add two commands at the start to un-tar the BLAST and pdbaa tar.gz files. See the previous exercise if you're not sure what these commands looks like. In order to distinguish this job from our previous job, change the output file name to something besides results.txt . The completed script run_blast.sh should look like this: #!/bin/bash tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt While not strictly necessary, it's a good idea to enable executable permissions on the wrapper script, like so: username@login $ chmod u+x run_blast.sh Your job is now ready to submit. Submit it using condor_submit and monitor using condor_q .","title":"4.2 - Use a Wrapper Script To Run Software"},{"location":"materials/software/part4-ex2-wrapper/#software-exercise-42-writing-a-wrapper-script","text":"Objective : Run downloaded software files via an intermediate, \"wrapper\" script. Why learn this? : This change is a good test of your general HTCondor knowledge and how to translate between executable and submit file. Using wrapper scripts is also a common practice for managing what happens in a job.","title":"Software Exercise 4.2: Writing a Wrapper Script"},{"location":"materials/software/part4-ex2-wrapper/#background","text":"Wrapper scripts are a useful tool for running software that can't be compiled into one piece, needs to be installed with every job, or just for running extra steps. A wrapper script can either install the software from the source code, or use an already existing software (as in this exercise). Not only does this portability technique work with almost any kind of software that can be locally installed, it also allows for a great deal of control and flexibility for what happens within your job. Once you can write a script to handle your software (and often your data as well), you can submit a large variety of workflows to a distributed computing system like the Open Science Grid. For this exercise, we will write a wrapper script as an alternate way to run the same job as the previous exercise.","title":"Background"},{"location":"materials/software/part4-ex2-wrapper/#wrapper-script-part-1","text":"Our wrapper script will be a bash script that runs several commands. In the same directory as the last exercise, make a file called run_blast.sh . The first line we'll place in the script is the basic command for running blast. Based on our previous submit file, what command needs to go into the script? Once you have an idea, check against the example below: #!/bin/bash ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results.txt","title":"Wrapper Script, part 1"},{"location":"materials/software/part4-ex2-wrapper/#submit-file-changes","text":"We now need to make some changes to our submit file. Make a copy of your previous submit file and open it to edit. Since we are now using a wrapper script, that will be our job's executable. Replace the original blastx exeuctable with the name of our wrapper script and comment out the arguments line. executable = run_blast.sh #arguments = Note that since the blastx program is no longer listed as the executable, it will be need to be included in transfer_input_files . Instead of transferring just that program, we will transfer the original downloaded tar.gz file. To achieve efficiency, we'll also transfer the pdbaa database as the original tar.gz file instead of as the unzipped folder: transfer_input_files = pdbaa.tar.gz, mouse.fa, ncbi-blast-2.15.0+-x64-linux.tar.gz If you really want to be on top of things, look at the log file for the last exercise, and update your memory and disk requests to be just slightly above the actual \"Usage\" values in the log. Before submitting, make sure to make the below additional changes to the wrapper script!","title":"Submit File Changes"},{"location":"materials/software/part4-ex2-wrapper/#wrapper-script-part-2","text":"Now that our database and BLAST software are being transferred to the job as tar.gz files, our script needs to accommodate. Opening your run_blast.sh script, add two commands at the start to un-tar the BLAST and pdbaa tar.gz files. See the previous exercise if you're not sure what these commands looks like. In order to distinguish this job from our previous job, change the output file name to something besides results.txt . The completed script run_blast.sh should look like this: #!/bin/bash tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt While not strictly necessary, it's a good idea to enable executable permissions on the wrapper script, like so: username@login $ chmod u+x run_blast.sh Your job is now ready to submit. Submit it using condor_submit and monitor using condor_q .","title":"Wrapper Script, part 2"},{"location":"materials/software/part4-ex3-arguments/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 4.3: Passing Arguments Through the Wrapper Script \u00b6 Objective : Add arguments to a wrapper script to make it more flexible and modular Why learn this? : Using script arguments will allow you to use the same script for multiple jobs, by providing different inputs or parameters. These arguments are normally passed on the command line, but in our world of job submission, the arguments will be listed in the submit file, in the arguments line. Identifying Potential Arguments \u00b6 In the same directory as the last exercise, make sure you're in the directory with your BLAST job submission. What values might we want to input to the script via arguments? Hint: anything that we might want to change if we were to run the script many times. In this example, some values we might want to change are the name of the comparison database, the input file, and the output file. Modifying Files \u00b6 We are going to add three arguments to the wrapper script, controlling the database, input and output file. Make a copy of your last submit file and open it for editing. Add an arguments line, or uncomment the one that exists, and add the three input values mentioned above. The arguments line in your submit file should look like this: arguments = pdbaa mouse.fa results3.txt (We're using results3.txt ) to distinguish between the previous two runs.) For bash (the language of our current wrapper script), the variables $1 , $2 and $3 represent the first, second, and third arguments, respectively. Thus, in the main command of the script, replace the various names with these variables: ./ncbi-blast-2.15.0+/bin/blastx -db $1 / $1 -query $2 -out $3 If your wrapper script is in a different language, you should use that language's syntax for reading in variables from the command line. Once these changes are made, submit your jobs with condor_submit . Use condor_q -nobatch to see what the job command looks like to HTCondor. It is now easy to change the inputs for the job; we can write them into the arguments line of the submit file and they will be propagated to the command in the wrapper script. We can even turn the submit file arguments into their own variables when submitting multiple jobs at once. Readability with Variables \u00b6 One of the downsides of this approach, is that our command has become harder to read. The original script contains all the information at a glance: ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt But our new version is more cryptic -- what is $1 ?: ./ncbi-blast-2.15.0+/bin/blastx -db $1 -query $2 -out $3 One way to overcome this is to create our own variable names inside the wrapper script and assign the argument values to them. Here is an example for our BLAST script: #!/bin/bash DATABASE = $1 INFILE = $2 OUTFILE = $3 tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db $DATABASE / $DATABASE -query $INFILE -out $OUTFILE Here, we are assigning the input arguments ( $1 , $2 and $3 ) to new variable names, and then using those names ( $DATABASE , $INFILE , and $OUTFILE ) in the command, which is easier to read. Edit your script to match the above syntax. Submit your jobs with condor_submit . When the job finishes, look at the job's standard output file to see how the variables printed.","title":"4.3 - Using Arguments With Wrapper Scripts"},{"location":"materials/software/part4-ex3-arguments/#software-exercise-43-passing-arguments-through-the-wrapper-script","text":"Objective : Add arguments to a wrapper script to make it more flexible and modular Why learn this? : Using script arguments will allow you to use the same script for multiple jobs, by providing different inputs or parameters. These arguments are normally passed on the command line, but in our world of job submission, the arguments will be listed in the submit file, in the arguments line.","title":"Software Exercise 4.3: Passing Arguments Through the Wrapper Script"},{"location":"materials/software/part4-ex3-arguments/#identifying-potential-arguments","text":"In the same directory as the last exercise, make sure you're in the directory with your BLAST job submission. What values might we want to input to the script via arguments? Hint: anything that we might want to change if we were to run the script many times. In this example, some values we might want to change are the name of the comparison database, the input file, and the output file.","title":"Identifying Potential Arguments"},{"location":"materials/software/part4-ex3-arguments/#modifying-files","text":"We are going to add three arguments to the wrapper script, controlling the database, input and output file. Make a copy of your last submit file and open it for editing. Add an arguments line, or uncomment the one that exists, and add the three input values mentioned above. The arguments line in your submit file should look like this: arguments = pdbaa mouse.fa results3.txt (We're using results3.txt ) to distinguish between the previous two runs.) For bash (the language of our current wrapper script), the variables $1 , $2 and $3 represent the first, second, and third arguments, respectively. Thus, in the main command of the script, replace the various names with these variables: ./ncbi-blast-2.15.0+/bin/blastx -db $1 / $1 -query $2 -out $3 If your wrapper script is in a different language, you should use that language's syntax for reading in variables from the command line. Once these changes are made, submit your jobs with condor_submit . Use condor_q -nobatch to see what the job command looks like to HTCondor. It is now easy to change the inputs for the job; we can write them into the arguments line of the submit file and they will be propagated to the command in the wrapper script. We can even turn the submit file arguments into their own variables when submitting multiple jobs at once.","title":"Modifying Files"},{"location":"materials/software/part4-ex3-arguments/#readability-with-variables","text":"One of the downsides of this approach, is that our command has become harder to read. The original script contains all the information at a glance: ./ncbi-blast-2.15.0+/bin/blastx -db pdbaa/pdbaa -query mouse.fa -out results2.txt But our new version is more cryptic -- what is $1 ?: ./ncbi-blast-2.15.0+/bin/blastx -db $1 -query $2 -out $3 One way to overcome this is to create our own variable names inside the wrapper script and assign the argument values to them. Here is an example for our BLAST script: #!/bin/bash DATABASE = $1 INFILE = $2 OUTFILE = $3 tar -xzf ncbi-blast-2.15.0+-x64-linux.tar.gz tar -xzf pdbaa.tar.gz ./ncbi-blast-2.15.0+/bin/blastx -db $DATABASE / $DATABASE -query $INFILE -out $OUTFILE Here, we are assigning the input arguments ( $1 , $2 and $3 ) to new variable names, and then using those names ( $DATABASE , $INFILE , and $OUTFILE ) in the command, which is easier to read. Edit your script to match the above syntax. Submit your jobs with condor_submit . When the job finishes, look at the job's standard output file to see how the variables printed.","title":"Readability with Variables"},{"location":"materials/software/part5-ex1-prepackaged/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 5.1: Pre-package a Research Code \u00b6 Objective : Install software (HMMER) to a folder and run it in a job using a wrapper script. Why learn this? : If not using a container, this is a template for how to create a portable software installation using your own files, especially if the software is not available already compiled for Linux. Our Software Example \u00b6 For this exercise, we will be using the bioinformatics package HMMER. HMMER is a good example of software that is not compiled to a single executable; it has multiple executables as well as a helper library. Create a directory for this exercise on the Access Point. Do an internet search to find the HMMER software downloads page and the installation instructions page. On the installation page, there are short instructions for how to install HMMER. There are two options shown for installation -- which should we use? For the purposes of this example, we are going to use the instructions under the \"Current version\" heading, with the \"Source\" link. Download the HMMER source using wget. Go back to the installation documentation page and look at the steps for compiling from source. This process should be similar to what was described in the lecture! Installation \u00b6 Normally, it is better to install software on a dedicated \"build\" server, but for this example, we are going to compile directly on the Access Point Before we follow the installation instructions, we should create a directory to hold our installation. You can create this in the current directory. username@host $ mkdir hmmer-build Now run the commands to unpack the source code: username@host $ tar -zxf hmmer-3.4.tar.gz username@host $ cd hmmer-3.4 Now we can follow the second set of installation instructions. For the prefix, we'll use the variable $PWD to capture the name of our current working directory and then a relative path to the hmmer-build directory we created in step 1: username@host $ ./configure --prefix = $PWD /../hmmer-build username@host $ make username@host $ make install Go back to the previous working directory : username@host $ cd .. and confirm that our installation procedure created bin , lib , and share directories in the hmmer-build folder: username@host $ ls hmmer-build bin share Now we want to package up our installation, so we can use it in other jobs. We can do this by compressing any necessary directories into a single gzipped tarball. username@host $ tar -czf hmmer-build.tar.gz hmmer-build Note that we now have two tarballs in our directory -- the source tarball ( hmmer.tar.gz ), which we will no longer need and our newly built installation ( hmmer-build.tar.gz ) which is what we will actually be using to run jobs. Wrapper Script \u00b6 Now that we've created our portable installation, we need to write a script that opens and uses the installation, similar to the process we used in a previous exercise . These steps should be performed back on the submit server ( ap1.facility.path-cc.io ). Create a script called run_hmmer.sh . The script will first need to untar our installation, so the script should start out like this: #!/bin/bash tar -xzf hmmer-build.tar.gz We're going to use the same $PWD trick from the installation in order to tell the computer how to find HMMER. We will do this by setting the PATH environment variable, to include the directory where HMMER is installed: export PATH = $PWD /hmmer-build/bin: $PATH Finally, the wrapper script needs to not only setup HMMER, but actually run the program. Add the following lines to your run_hmmer.sh wrapper script. hmmbuild globins4.hmm globins4.sto hmmsearch -o search-results.txt globins4.hmm globins45.fa Make sure the wrapper script has executable permissions: username@login $ chmod u+x run_hmmer.sh Run a HMMER job \u00b6 We're almost ready! We need two more pieces to run a HMMER job. We're going to use some of the tutorial files provided with the HMMER download to run the job. You already have these files back in the directory where you unpacked the source code: username@login $ ls hmmer-3.4/tutorial 7LESS_DROME fn3.hmm globins45.fa globins4.sto MADE1.hmm Pkinase.hmm dna_target.fa fn3.sto globins4.hmm HBB_HUMAN MADE1.sto Pkinase.sto If you don't see these files, you may want to redownload the hmmer.tar.gz file and untar it here. Our last step is to create a submit file for our HMMER job. Think about which lines this submit file will need. Make a copy of a previous submit file (you could use the blast submit file from a previous exercise as a base) and modify it as you think necessary. The two most important lines to modify for this job are listed below; check them against your own submit file: executable = run_hmmer.sh transfer_input_files = hmmer-build.tar.gz, hmmer-3.4/tutorial/ A wrapper script will always be a job's executable . When using a wrapper script, you must also always remember to transfer the software/source code using transfer_input_files . Note The / in the transfer_input_files line indicates that we are transferring the contents of that directory (which in this case, is what we want), rather than the directory itself. Submit the job with condor_submit . Once the job completes, it should produce a search-results.txt file. Note For a very similar compiling example, see this guide on how to compile samtools : Example Software Compilation","title":"5.1 - Compiling a Research Software"},{"location":"materials/software/part5-ex1-prepackaged/#software-exercise-51-pre-package-a-research-code","text":"Objective : Install software (HMMER) to a folder and run it in a job using a wrapper script. Why learn this? : If not using a container, this is a template for how to create a portable software installation using your own files, especially if the software is not available already compiled for Linux.","title":"Software Exercise 5.1: Pre-package a Research Code"},{"location":"materials/software/part5-ex1-prepackaged/#our-software-example","text":"For this exercise, we will be using the bioinformatics package HMMER. HMMER is a good example of software that is not compiled to a single executable; it has multiple executables as well as a helper library. Create a directory for this exercise on the Access Point. Do an internet search to find the HMMER software downloads page and the installation instructions page. On the installation page, there are short instructions for how to install HMMER. There are two options shown for installation -- which should we use? For the purposes of this example, we are going to use the instructions under the \"Current version\" heading, with the \"Source\" link. Download the HMMER source using wget. Go back to the installation documentation page and look at the steps for compiling from source. This process should be similar to what was described in the lecture!","title":"Our Software Example"},{"location":"materials/software/part5-ex1-prepackaged/#installation","text":"Normally, it is better to install software on a dedicated \"build\" server, but for this example, we are going to compile directly on the Access Point Before we follow the installation instructions, we should create a directory to hold our installation. You can create this in the current directory. username@host $ mkdir hmmer-build Now run the commands to unpack the source code: username@host $ tar -zxf hmmer-3.4.tar.gz username@host $ cd hmmer-3.4 Now we can follow the second set of installation instructions. For the prefix, we'll use the variable $PWD to capture the name of our current working directory and then a relative path to the hmmer-build directory we created in step 1: username@host $ ./configure --prefix = $PWD /../hmmer-build username@host $ make username@host $ make install Go back to the previous working directory : username@host $ cd .. and confirm that our installation procedure created bin , lib , and share directories in the hmmer-build folder: username@host $ ls hmmer-build bin share Now we want to package up our installation, so we can use it in other jobs. We can do this by compressing any necessary directories into a single gzipped tarball. username@host $ tar -czf hmmer-build.tar.gz hmmer-build Note that we now have two tarballs in our directory -- the source tarball ( hmmer.tar.gz ), which we will no longer need and our newly built installation ( hmmer-build.tar.gz ) which is what we will actually be using to run jobs.","title":"Installation"},{"location":"materials/software/part5-ex1-prepackaged/#wrapper-script","text":"Now that we've created our portable installation, we need to write a script that opens and uses the installation, similar to the process we used in a previous exercise . These steps should be performed back on the submit server ( ap1.facility.path-cc.io ). Create a script called run_hmmer.sh . The script will first need to untar our installation, so the script should start out like this: #!/bin/bash tar -xzf hmmer-build.tar.gz We're going to use the same $PWD trick from the installation in order to tell the computer how to find HMMER. We will do this by setting the PATH environment variable, to include the directory where HMMER is installed: export PATH = $PWD /hmmer-build/bin: $PATH Finally, the wrapper script needs to not only setup HMMER, but actually run the program. Add the following lines to your run_hmmer.sh wrapper script. hmmbuild globins4.hmm globins4.sto hmmsearch -o search-results.txt globins4.hmm globins45.fa Make sure the wrapper script has executable permissions: username@login $ chmod u+x run_hmmer.sh","title":"Wrapper Script"},{"location":"materials/software/part5-ex1-prepackaged/#run-a-hmmer-job","text":"We're almost ready! We need two more pieces to run a HMMER job. We're going to use some of the tutorial files provided with the HMMER download to run the job. You already have these files back in the directory where you unpacked the source code: username@login $ ls hmmer-3.4/tutorial 7LESS_DROME fn3.hmm globins45.fa globins4.sto MADE1.hmm Pkinase.hmm dna_target.fa fn3.sto globins4.hmm HBB_HUMAN MADE1.sto Pkinase.sto If you don't see these files, you may want to redownload the hmmer.tar.gz file and untar it here. Our last step is to create a submit file for our HMMER job. Think about which lines this submit file will need. Make a copy of a previous submit file (you could use the blast submit file from a previous exercise as a base) and modify it as you think necessary. The two most important lines to modify for this job are listed below; check them against your own submit file: executable = run_hmmer.sh transfer_input_files = hmmer-build.tar.gz, hmmer-3.4/tutorial/ A wrapper script will always be a job's executable . When using a wrapper script, you must also always remember to transfer the software/source code using transfer_input_files . Note The / in the transfer_input_files line indicates that we are transferring the contents of that directory (which in this case, is what we want), rather than the directory itself. Submit the job with condor_submit . Once the job completes, it should produce a search-results.txt file. Note For a very similar compiling example, see this guide on how to compile samtools : Example Software Compilation","title":"Run a HMMER job"},{"location":"materials/software/part5-ex2-python/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 5.2: Using Python, Pre-Built \u00b6 In this exercise, you will install Python, package your installation, and then use it to run jobs. It should take about 20 minutes. Background \u00b6 Objective : Install software (Python) to a folder and run it in a job using a wrapper script. Why learn this? : This is very similar to the previous exercise . Pre-Building \u00b6 The first step in our job process is building a Python installation that we can package up. Create a directory for this exercise on the Access Point and cd into it. Download the Python source code from https://www.python.org/ . username@login $ wget https://www.python.org/ftp/python/3.10.5/Python-3.10.5.tgz First, we have to determine how to install Python to a specific location in our working directory. Untar the Python source tarball ( tar -xzf Python-3.10.5.tgz ) and look at the README.rst file in the Python-3.10.5 directory ( cd Python-3.10.5 ). You'll want to look for the \"Build Instructions\" header. What will the main installation steps be? What command is required for the final installation? Once you've tried to answer these questions, move to the next step. There are some basic installation instructions near the top of the README . Based on that short introduction, we can see the main steps of installation will be: ./configure make make test sudo make install This three-stage process (configure, make, make install) is a common way to install many software packages. The default installation location for Python requires sudo (administrative privileges) to install. However, we'd like to install to a specific location in the working directory so that we can compress that installation directory into a tarball. You can often use an option called -prefix with the configure script to change the default installation directory. Let's see if the Python configure script has this option by using the \"help\" option (as suggested in the README.rst file): username@host $ ./configure --help Sure enough, there's a list of all the different options that can be passed to the configure script, which includes --prefix . (To see the --prefix option, you may need to scroll towards the top of the output.) Therefore, we can use the $PWD command in order to set the path correctly to a custom installation directory. Now let's actually install Python! From the original working directory , create a directory to hold the installation. username@host $ cd ../ username@host $ mkdir python310 Move into the Python-3.10.5 directory and run the installation commands. These may take a few minutes each. username@host $ cd Python-3.10.5 username@host $ ./configure --prefix = $PWD /../python310 username@host $ make username@host $ make install Note The installation instructions in the README.rst file have a make test step between the make and make install steps. As this step isn't strictly necessary (and takes a long time), it's been omitted above. If I move back to the main job working directory, and look in the python subdirectory, I should see a Python installation. username@host $ cd .. username@host $ ls python310/ bin include lib share I have successfully created a self-contained Python installation. Now it just needs to be tarred up! username@host $ tar -czf prebuilt_python.tar.gz python310/ We might want to know how we installed Python for later reference. Enter the following commands to save our history to a file: username@host $ history > python_install.txt Python Script \u00b6 Create a script with the following lines called fib.py . import sys import os if len ( sys . argv ) != 2 : print ( 'Usage: %s MAXIMUM' % ( os . path . basename ( sys . argv [ 0 ]))) sys . exit ( 1 ) maximum = int ( sys . argv [ 1 ]) n1 = n2 = 1 while n2 <= maximum : n1 , n2 = n2 , n1 + n2 print ( 'The greatest Fibonacci number up to %d is %d ' % ( maximum , n1 )) What command line arguments does this script take? Try running it on the submit server. Wrapper Script \u00b6 We now have our Python installation and our Python script - we just need to write a wrapper script to run them. What steps do you think the wrapper script needs to perform? Create a file called run_fib.sh and write them out in plain English before moving to the next step. Our script will need to untar our prebuilt_python.tar.gz file access the python command from our installation to run our fib.py script Try turning your plain English steps into commands that the computer can run. Your final run_fib.sh script should look something like this: #!/bin/bash tar -xzf prebuilt_python.tar.gz python310/bin/python3 fib.py 90 or #!/bin/bash tar -xzf prebuilt_python.tar.gz export PATH = $( pwd ) /python310/bin: $PATH python3 fib.py 90 Make sure your run_fib.sh script is executable. Submit File \u00b6 Make a copy of a previous submit file in your local directory (the submit file from the Use a Wrapper Script exercise might be a good candidate). What changes need to be made to run this Python job? Modify your submit file, then make sure you've included the key lines below: executable = run_fib.sh transfer_input_files = fib.py, prebuilt_python.tar.gz Submit the job using condor_submit . Check the .out file to see if the job completed.","title":"5.2 - Compiling Python and Running Jobs"},{"location":"materials/software/part5-ex2-python/#software-exercise-52-using-python-pre-built","text":"In this exercise, you will install Python, package your installation, and then use it to run jobs. It should take about 20 minutes.","title":"Software Exercise 5.2: Using Python, Pre-Built"},{"location":"materials/software/part5-ex2-python/#background","text":"Objective : Install software (Python) to a folder and run it in a job using a wrapper script. Why learn this? : This is very similar to the previous exercise .","title":"Background"},{"location":"materials/software/part5-ex2-python/#pre-building","text":"The first step in our job process is building a Python installation that we can package up. Create a directory for this exercise on the Access Point and cd into it. Download the Python source code from https://www.python.org/ . username@login $ wget https://www.python.org/ftp/python/3.10.5/Python-3.10.5.tgz First, we have to determine how to install Python to a specific location in our working directory. Untar the Python source tarball ( tar -xzf Python-3.10.5.tgz ) and look at the README.rst file in the Python-3.10.5 directory ( cd Python-3.10.5 ). You'll want to look for the \"Build Instructions\" header. What will the main installation steps be? What command is required for the final installation? Once you've tried to answer these questions, move to the next step. There are some basic installation instructions near the top of the README . Based on that short introduction, we can see the main steps of installation will be: ./configure make make test sudo make install This three-stage process (configure, make, make install) is a common way to install many software packages. The default installation location for Python requires sudo (administrative privileges) to install. However, we'd like to install to a specific location in the working directory so that we can compress that installation directory into a tarball. You can often use an option called -prefix with the configure script to change the default installation directory. Let's see if the Python configure script has this option by using the \"help\" option (as suggested in the README.rst file): username@host $ ./configure --help Sure enough, there's a list of all the different options that can be passed to the configure script, which includes --prefix . (To see the --prefix option, you may need to scroll towards the top of the output.) Therefore, we can use the $PWD command in order to set the path correctly to a custom installation directory. Now let's actually install Python! From the original working directory , create a directory to hold the installation. username@host $ cd ../ username@host $ mkdir python310 Move into the Python-3.10.5 directory and run the installation commands. These may take a few minutes each. username@host $ cd Python-3.10.5 username@host $ ./configure --prefix = $PWD /../python310 username@host $ make username@host $ make install Note The installation instructions in the README.rst file have a make test step between the make and make install steps. As this step isn't strictly necessary (and takes a long time), it's been omitted above. If I move back to the main job working directory, and look in the python subdirectory, I should see a Python installation. username@host $ cd .. username@host $ ls python310/ bin include lib share I have successfully created a self-contained Python installation. Now it just needs to be tarred up! username@host $ tar -czf prebuilt_python.tar.gz python310/ We might want to know how we installed Python for later reference. Enter the following commands to save our history to a file: username@host $ history > python_install.txt","title":"Pre-Building"},{"location":"materials/software/part5-ex2-python/#python-script","text":"Create a script with the following lines called fib.py . import sys import os if len ( sys . argv ) != 2 : print ( 'Usage: %s MAXIMUM' % ( os . path . basename ( sys . argv [ 0 ]))) sys . exit ( 1 ) maximum = int ( sys . argv [ 1 ]) n1 = n2 = 1 while n2 <= maximum : n1 , n2 = n2 , n1 + n2 print ( 'The greatest Fibonacci number up to %d is %d ' % ( maximum , n1 )) What command line arguments does this script take? Try running it on the submit server.","title":"Python Script"},{"location":"materials/software/part5-ex2-python/#wrapper-script","text":"We now have our Python installation and our Python script - we just need to write a wrapper script to run them. What steps do you think the wrapper script needs to perform? Create a file called run_fib.sh and write them out in plain English before moving to the next step. Our script will need to untar our prebuilt_python.tar.gz file access the python command from our installation to run our fib.py script Try turning your plain English steps into commands that the computer can run. Your final run_fib.sh script should look something like this: #!/bin/bash tar -xzf prebuilt_python.tar.gz python310/bin/python3 fib.py 90 or #!/bin/bash tar -xzf prebuilt_python.tar.gz export PATH = $( pwd ) /python310/bin: $PATH python3 fib.py 90 Make sure your run_fib.sh script is executable.","title":"Wrapper Script"},{"location":"materials/software/part5-ex2-python/#submit-file","text":"Make a copy of a previous submit file in your local directory (the submit file from the Use a Wrapper Script exercise might be a good candidate). What changes need to be made to run this Python job? Modify your submit file, then make sure you've included the key lines below: executable = run_fib.sh transfer_input_files = fib.py, prebuilt_python.tar.gz Submit the job using condor_submit . Check the .out file to see if the job completed.","title":"Submit File"},{"location":"materials/software/part5-ex3-conda/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: #008; } Software Exercise 5.3: Using Conda Environments \u00b6 Objective : Create a portable conda environment and use it in a job. Why learn this? : If you normally use conda to manage your Python environments, this method of software portability offers great similarity to your usual practices. Introduction \u00b6 Many Python users manage their Python installation and environments with either the Anaconda or miniconda distributions. These distribution tools are great for creating portable Python installations and can be used on HTC systems with some help from a tool called conda pack . Sample Script \u00b6 For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library. Create and Pack a Conda Environment \u00b6 (For a generic version of these instructions, see the CHTC User Guide ) Our first step is to create a miniconda installation on the submit server. You should be logged into whichever server you made the rand_array.py script on. Download the latest Linux miniconda installer user@login $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh Run the installer to install miniconda; you'll need to accept the license terms and you can use the default installation location: [user@login]$ sh Miniconda3-latest-Linux-x86_64.sh At the end, you can choose whether or not to \"initialize Miniconda3 by running conda init?\" The default is no; you would then run the eval command listed by the installer to \"activate\" Miniconda. If you choose \"no\" you'll want to save this command so that you can reactivate the Miniconda installation when needed in the future. Next we'll create our conda \"environment\" with numpy (we've called the environment \"py3-numpy\"): (base) [user@login]$ conda create -n py3-numpy (base) [user@login]$ conda activate py3-numpy (py3-numpy) [user@login]$ conda install -c conda-forge numpy Once everything is installed, deactivate the environment to go back to the Miniconda \"base\" environment. (py3-numpy) [user@login]$ conda deactivate We'll now install a tool that will pack up the just created conda environment so we can run it elsewhere. Make sure that your job's Miniconda environment is created, but deactivated, so that you're in the \"base\" Miniconda environment, then run: (base) [user@login]$ conda install -c conda-forge conda-pack Enter y when it asks you to install. Finally, we will run the conda pack command, which will automatically create a tar.gz file with our environment: (base) [user@login]$ conda pack -n py3-numpy Submit a Job \u00b6 The executable for this job will need to be a wrapper script. What steps do you think need to be included? Write down a rough draft, then compare with the following script. Create a wrapper script like the following: #!/bin/bash set -e export PATH mkdir py3-numpy tar -xzf py3-numpy.tar.gz -C py3-numpy . py3-numpy/bin/activate python3 rand_array.py What needs to be included in your submit file for the job to run successfully? Try yourself and then check the suggestions in the next point. In your submit file, make sure to have the following: Your executable should be the the bash script you created in the previous step. Remember to transfer your Python script and the environment tar.gz file via transfer_input_files . Submit the job and see what happens!","title":"5.3 - Using Conda Environments"},{"location":"materials/software/part5-ex3-conda/#software-exercise-53-using-conda-environments","text":"Objective : Create a portable conda environment and use it in a job. Why learn this? : If you normally use conda to manage your Python environments, this method of software portability offers great similarity to your usual practices.","title":"Software Exercise 5.3: Using Conda Environments"},{"location":"materials/software/part5-ex3-conda/#introduction","text":"Many Python users manage their Python installation and environments with either the Anaconda or miniconda distributions. These distribution tools are great for creating portable Python installations and can be used on HTC systems with some help from a tool called conda pack .","title":"Introduction"},{"location":"materials/software/part5-ex3-conda/#sample-script","text":"For this example, create a script called rand_array.py on the Access Point. import numpy as np #numpy array with random values a = np.random.rand(4,2,3) print(a) To run this script, we will need a copy of Python with the numpy library.","title":"Sample Script"},{"location":"materials/software/part5-ex3-conda/#create-and-pack-a-conda-environment","text":"(For a generic version of these instructions, see the CHTC User Guide ) Our first step is to create a miniconda installation on the submit server. You should be logged into whichever server you made the rand_array.py script on. Download the latest Linux miniconda installer user@login $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh Run the installer to install miniconda; you'll need to accept the license terms and you can use the default installation location: [user@login]$ sh Miniconda3-latest-Linux-x86_64.sh At the end, you can choose whether or not to \"initialize Miniconda3 by running conda init?\" The default is no; you would then run the eval command listed by the installer to \"activate\" Miniconda. If you choose \"no\" you'll want to save this command so that you can reactivate the Miniconda installation when needed in the future. Next we'll create our conda \"environment\" with numpy (we've called the environment \"py3-numpy\"): (base) [user@login]$ conda create -n py3-numpy (base) [user@login]$ conda activate py3-numpy (py3-numpy) [user@login]$ conda install -c conda-forge numpy Once everything is installed, deactivate the environment to go back to the Miniconda \"base\" environment. (py3-numpy) [user@login]$ conda deactivate We'll now install a tool that will pack up the just created conda environment so we can run it elsewhere. Make sure that your job's Miniconda environment is created, but deactivated, so that you're in the \"base\" Miniconda environment, then run: (base) [user@login]$ conda install -c conda-forge conda-pack Enter y when it asks you to install. Finally, we will run the conda pack command, which will automatically create a tar.gz file with our environment: (base) [user@login]$ conda pack -n py3-numpy","title":"Create and Pack a Conda Environment"},{"location":"materials/software/part5-ex3-conda/#submit-a-job","text":"The executable for this job will need to be a wrapper script. What steps do you think need to be included? Write down a rough draft, then compare with the following script. Create a wrapper script like the following: #!/bin/bash set -e export PATH mkdir py3-numpy tar -xzf py3-numpy.tar.gz -C py3-numpy . py3-numpy/bin/activate python3 rand_array.py What needs to be included in your submit file for the job to run successfully? Try yourself and then check the suggestions in the next point. In your submit file, make sure to have the following: Your executable should be the the bash script you created in the previous step. Remember to transfer your Python script and the environment tar.gz file via transfer_input_files . Submit the job and see what happens!","title":"Submit a Job"},{"location":"materials/software/part5-ex4-compiling/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Software Exercise 5.4: Compile Statically Linked Code \u00b6 Objective : Compile code using static linking, explain why this can be useful. Why learn this? : When code is compiled, it is usually linked to other pieces of code on the computer. This can cause it to not work when moved to other computers. Static linking means that all the needed references are included in the compiled code, meaning that it can run almost anywhere. Our Software Example \u00b6 For this compiling example, we will use a script written in C. C code depends on libraries and therefore will benefit from being statically linked. Our C code prints 7 rows of Pascal's triangle. Log into the Access Point. Create a directory for this exercise and cd into it. Copy and paste the following code into a file named pascal.c . #include \"stdio.h\" long factorial ( int ); int main () { int i , n , c ; n = 7 ; for ( i = 0 ; i < n ; i ++ ){ for ( c = 0 ; c <= ( n - i - 2 ); c ++ ) printf ( \" \" ); for ( c = 0 ; c <= i ; c ++ ) printf ( \"%ld \" , factorial ( i ) / ( factorial ( c ) * factorial ( i - c ))); printf ( \" \\n \" ); } return 0 ; } long factorial ( int n ) { int c ; long result = 1 ; for ( c = 1 ; c <= n ; c ++ ) result = result * c ; return result ; } Compiling \u00b6 In order to use this code in a job, we will first need to statically compile the code. Most linux servers (including our Access Point) have the gcc (GNU compiler collection) installed, so we already have a compiler on the Access Point. Furthermore, this is a simple piece of C code, so the compilation will not be computationally intensive. Thus, we should be able to compile directly on the Access Point Compile the code, using the command: username@login $ gcc -static pascal.c -o pascal Note that we have added the -static option to make sure that the compiled binary includes the necessary libraries. This will allow the code to run on any Linux machine, no matter where those libraries are located. Verify that the compiled binary was statically linked: username@login $ file pascal The Linux file command provides information about the type or kind of file that is given as an argument. In this case, you should get output like this: username@host $ file pascal pascal: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.18, not stripped The output clearly states that this executable (software) is statically linked. The same command run on a non-statically linked executable file would include the text dynamically linked (uses shared libs) instead. So with this simple verification step, which could even be run on files that you did not compile yourself, you have some further reassurance that it is safe to use on other Linux machines. (Bonus exercise: Try the file command on lots of other files) Submit the Job \u00b6 Now that our code is compiled, we can use it to submit a job. Think about what submit file lines we need to use to run this job: Are there input files? Are there command line arguments? Where is its output written? Based on what you thought about in 1., find a submit file from earlier that you can modify to run our compiled pascal code. Copy it to the directory with the pascal binary and make those changes. Submit the job using condor_submit . Once the job has run and left the queue, you should be able to see the results (seven rows of Pascal's triangle) in the .out file created by the job.","title":"5.4 - Compiling and Running a Simple Code"},{"location":"materials/software/part5-ex4-compiling/#software-exercise-54-compile-statically-linked-code","text":"Objective : Compile code using static linking, explain why this can be useful. Why learn this? : When code is compiled, it is usually linked to other pieces of code on the computer. This can cause it to not work when moved to other computers. Static linking means that all the needed references are included in the compiled code, meaning that it can run almost anywhere.","title":"Software Exercise 5.4: Compile Statically Linked Code"},{"location":"materials/software/part5-ex4-compiling/#our-software-example","text":"For this compiling example, we will use a script written in C. C code depends on libraries and therefore will benefit from being statically linked. Our C code prints 7 rows of Pascal's triangle. Log into the Access Point. Create a directory for this exercise and cd into it. Copy and paste the following code into a file named pascal.c . #include \"stdio.h\" long factorial ( int ); int main () { int i , n , c ; n = 7 ; for ( i = 0 ; i < n ; i ++ ){ for ( c = 0 ; c <= ( n - i - 2 ); c ++ ) printf ( \" \" ); for ( c = 0 ; c <= i ; c ++ ) printf ( \"%ld \" , factorial ( i ) / ( factorial ( c ) * factorial ( i - c ))); printf ( \" \\n \" ); } return 0 ; } long factorial ( int n ) { int c ; long result = 1 ; for ( c = 1 ; c <= n ; c ++ ) result = result * c ; return result ; }","title":"Our Software Example"},{"location":"materials/software/part5-ex4-compiling/#compiling","text":"In order to use this code in a job, we will first need to statically compile the code. Most linux servers (including our Access Point) have the gcc (GNU compiler collection) installed, so we already have a compiler on the Access Point. Furthermore, this is a simple piece of C code, so the compilation will not be computationally intensive. Thus, we should be able to compile directly on the Access Point Compile the code, using the command: username@login $ gcc -static pascal.c -o pascal Note that we have added the -static option to make sure that the compiled binary includes the necessary libraries. This will allow the code to run on any Linux machine, no matter where those libraries are located. Verify that the compiled binary was statically linked: username@login $ file pascal The Linux file command provides information about the type or kind of file that is given as an argument. In this case, you should get output like this: username@host $ file pascal pascal: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.18, not stripped The output clearly states that this executable (software) is statically linked. The same command run on a non-statically linked executable file would include the text dynamically linked (uses shared libs) instead. So with this simple verification step, which could even be run on files that you did not compile yourself, you have some further reassurance that it is safe to use on other Linux machines. (Bonus exercise: Try the file command on lots of other files)","title":"Compiling"},{"location":"materials/software/part5-ex4-compiling/#submit-the-job","text":"Now that our code is compiled, we can use it to submit a job. Think about what submit file lines we need to use to run this job: Are there input files? Are there command line arguments? Where is its output written? Based on what you thought about in 1., find a submit file from earlier that you can modify to run our compiled pascal code. Copy it to the directory with the pascal binary and make those changes. Submit the job using condor_submit . Once the job has run and left the queue, you should be able to see the results (seven rows of Pascal's triangle) in the .out file created by the job.","title":"Submit the Job"},{"location":"materials/special/part1-ex1-gpus/","text":"Exercise 1.1: GPUs \u00b6 Exploring Availability \u00b6 For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Let's first explore what GPUs are available in the OSPool. Remember that the pool is dynamic - resources are beeing added and removed all the time - but we can at least find out what the current set of GPUs are there. Run: user@ap40 $ condor_status -const 'GPUs > 0' Once you have that list, pick one of the resources and look at the classad using the -l flag. For example: user@ap40 $ condor_status -l [ MACHINE ] Using the -autoformat flag, explore the different attributes of the GPUs. Some interesting attributes might be GPUs_DeviceName , GPUs_Capability , GLIDEIN_Site and GLIDEIN_ResourceName . Compare the Mips number of a GPU slot with a regular slot. Does the Mips number indicate that GPUs can be much faster than CPUs? Why/why not? A sample GPU job \u00b6 Create a file named mytf.py and chmod it to be executable. The content is a sample TensorFlow code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #!/usr/bin/python3 # http://learningtensorflow.com/lesson10/ import sys import numpy as np import tensorflow as tf from datetime import datetime tf . debugging . set_log_device_placement ( True ) # Create some tensors a = tf . constant ([[ 1.0 , 2.0 , 3.0 ], [ 4.0 , 5.0 , 6.0 ]]) b = tf . constant ([[ 1.0 , 2.0 ], [ 3.0 , 4.0 ], [ 5.0 , 6.0 ]]) c = tf . matmul ( a , b ) print ( c ) Then, create a submit file to run the code on a GPU, using a TensorFlow container image. The new bits of the submit file is provided below, but you will have to fill in the rest from what you have learnt earlier in the User School. universe = container container_image = /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.2-cuda-10.1 executable = mytf.py request_gpus = 1 Note that TensorFlow also require the AVX2 CPU extensions. Remember that AVX2 is available in the x86_64-v3 and x86_64-v4 micro architectures. Add a requirements line stating that Microarch has to be one of those two (the operand for or in the classad experssions is || ) Submit the job and watch the queue. Did the job start running as quickly as when we ran CPU jobs? Why/why not? Examine the out/err files. Does it indicate somewhere that the job was mapped to a GPU? (Hint: search for Created TensorFlow device ) Keep a copy of the out/err. Modify the submit file to not run on a GPU, and the try the job again. Did the job work? Does the err from the CPU job look anything like the GPU err?","title":"1.1 - GPUs"},{"location":"materials/special/part1-ex1-gpus/#exercise-11-gpus","text":"","title":"Exercise 1.1: GPUs"},{"location":"materials/special/part1-ex1-gpus/#exploring-availability","text":"For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in: $ ssh @ap40.uw.osg-htc.org Let's first explore what GPUs are available in the OSPool. Remember that the pool is dynamic - resources are beeing added and removed all the time - but we can at least find out what the current set of GPUs are there. Run: user@ap40 $ condor_status -const 'GPUs > 0' Once you have that list, pick one of the resources and look at the classad using the -l flag. For example: user@ap40 $ condor_status -l [ MACHINE ] Using the -autoformat flag, explore the different attributes of the GPUs. Some interesting attributes might be GPUs_DeviceName , GPUs_Capability , GLIDEIN_Site and GLIDEIN_ResourceName . Compare the Mips number of a GPU slot with a regular slot. Does the Mips number indicate that GPUs can be much faster than CPUs? Why/why not?","title":"Exploring Availability"},{"location":"materials/special/part1-ex1-gpus/#a-sample-gpu-job","text":"Create a file named mytf.py and chmod it to be executable. The content is a sample TensorFlow code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #!/usr/bin/python3 # http://learningtensorflow.com/lesson10/ import sys import numpy as np import tensorflow as tf from datetime import datetime tf . debugging . set_log_device_placement ( True ) # Create some tensors a = tf . constant ([[ 1.0 , 2.0 , 3.0 ], [ 4.0 , 5.0 , 6.0 ]]) b = tf . constant ([[ 1.0 , 2.0 ], [ 3.0 , 4.0 ], [ 5.0 , 6.0 ]]) c = tf . matmul ( a , b ) print ( c ) Then, create a submit file to run the code on a GPU, using a TensorFlow container image. The new bits of the submit file is provided below, but you will have to fill in the rest from what you have learnt earlier in the User School. universe = container container_image = /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.2-cuda-10.1 executable = mytf.py request_gpus = 1 Note that TensorFlow also require the AVX2 CPU extensions. Remember that AVX2 is available in the x86_64-v3 and x86_64-v4 micro architectures. Add a requirements line stating that Microarch has to be one of those two (the operand for or in the classad experssions is || ) Submit the job and watch the queue. Did the job start running as quickly as when we ran CPU jobs? Why/why not? Examine the out/err files. Does it indicate somewhere that the job was mapped to a GPU? (Hint: search for Created TensorFlow device ) Keep a copy of the out/err. Modify the submit file to not run on a GPU, and the try the job again. Did the job work? Does the err from the CPU job look anything like the GPU err?","title":"A sample GPU job"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/","text":"OSG Exercise 2.1: Troubleshooting Jobs \u00b6 The goal of this exercise is to practice troubleshooting some common problems that you may encounter when submitting jobs using HTCondor. This exercise should work on either of the access points- OSPool or Path Facility Note: This exercise is a little harder than some others. To complete it, you will have to find and fix several issues. Be patient, keep trying, but if you really get stuck, you can ask for help or look at the very bottom of this page for a link to answers. But try not to look at the answers! Acquiring the Materials \u00b6 We have prepared some Python code, data, and submit files for this exercise: Log into an Access Point Download a tarball of the materials: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting.tar.gz Extract the tarball using the commands that you learned earlier Change into the newly extracted directory and explore its contents \u2014 resist the temptation to fix things right away! Solving a Project Euler Problem \u00b6 The contents of the tarball contain a series of submit files, Python scripts, and an input file that are designed to solve Project Euler problem 98 : By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^2. What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^2. We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter. Using p098_words.txt, a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself). What is the largest square number formed by any member of such a pair? NOTE: All anagrams formed must be contained in the given text file. Unfortunately, there are many issues with the submit files that you will have to work through before you can obtain the solution to the problem! The code in the Python scripts themselves is, in theory, free of bugs. Finding anagrams \u00b6 The first step in our workflow takes an input file with a list of words ( p098_words.txt ) and extracts all of the anagrams using the find_anagrams.py script. Naturally, we want to run this as an HTCondor job, so: Submit the accompanying find-anagrams.sub file from the tarball. Resolve any issues that you encounter until the job returns pairs of anagrams as its output. Once you have satisfactory output, move onto the next section. Please be polite Access points are shared resources, so you should clean up after yourself. If you discover any jobs in the Hold state, and after you are done troubleshooting them, remove them with the following command: user@server $ condor_rm -const 'JobStatus =?= 5' Where replacing with... Will remove... Your username (e.g. blin ) All of your held jobs A cluster ID (e.g. 74078 ) All held jobs matching the given cluster ID A job ID (e.g. 97932.30 ) That specific held job Finding the largest square \u00b6 The next step in the workflow uses the max_square.py script to find the largest square number, if any, for a given anagram word pair. Let's submit jobs that run max_square.py for all of the anagram word pairs (i.e., one job per word pair), that you found in the previous section: Submit the accompanying squares.sub file from the tarball Resolve any issues that you encounter until you receive output for each job. Note that some jobs may have empty output since not all anagram word pairs are square anagram word pairs. Next, you can find the largest square among your output by directly using the command line. For example, if all of your job output has been placed in the squares directory and are named square-1.out , square-2.out , etc., then you could run the following command to find the largest square: user@server $ cat squares/square-*.out | sort -n | tail -n 1 You can check if you have the right answer with any of the OSG staff or by submitting the answer to Project Euler (requires an account). Answer Key \u00b6 There is also a working solution on our web server that can be retrieved with user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting-key.tar.gz It contains comments labeled SOLUTION that you can consult in case you get stuck. Like any answer key, it is mainly useful as a verification tool, so try to only use it as a last resort or for detailed explanations to improve your understanding.","title":"1.1 - Troubleshooting Jobs"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#osg-exercise-21-troubleshooting-jobs","text":"The goal of this exercise is to practice troubleshooting some common problems that you may encounter when submitting jobs using HTCondor. This exercise should work on either of the access points- OSPool or Path Facility Note: This exercise is a little harder than some others. To complete it, you will have to find and fix several issues. Be patient, keep trying, but if you really get stuck, you can ask for help or look at the very bottom of this page for a link to answers. But try not to look at the answers!","title":"OSG Exercise 2.1: Troubleshooting Jobs"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#acquiring-the-materials","text":"We have prepared some Python code, data, and submit files for this exercise: Log into an Access Point Download a tarball of the materials: user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting.tar.gz Extract the tarball using the commands that you learned earlier Change into the newly extracted directory and explore its contents \u2014 resist the temptation to fix things right away!","title":"Acquiring the Materials"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#solving-a-project-euler-problem","text":"The contents of the tarball contain a series of submit files, Python scripts, and an input file that are designed to solve Project Euler problem 98 : By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^2. What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^2. We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter. Using p098_words.txt, a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself). What is the largest square number formed by any member of such a pair? NOTE: All anagrams formed must be contained in the given text file. Unfortunately, there are many issues with the submit files that you will have to work through before you can obtain the solution to the problem! The code in the Python scripts themselves is, in theory, free of bugs.","title":"Solving a Project Euler Problem"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#finding-anagrams","text":"The first step in our workflow takes an input file with a list of words ( p098_words.txt ) and extracts all of the anagrams using the find_anagrams.py script. Naturally, we want to run this as an HTCondor job, so: Submit the accompanying find-anagrams.sub file from the tarball. Resolve any issues that you encounter until the job returns pairs of anagrams as its output. Once you have satisfactory output, move onto the next section. Please be polite Access points are shared resources, so you should clean up after yourself. If you discover any jobs in the Hold state, and after you are done troubleshooting them, remove them with the following command: user@server $ condor_rm -const 'JobStatus =?= 5' Where replacing with... Will remove... Your username (e.g. blin ) All of your held jobs A cluster ID (e.g. 74078 ) All held jobs matching the given cluster ID A job ID (e.g. 97932.30 ) That specific held job","title":"Finding anagrams"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#finding-the-largest-square","text":"The next step in the workflow uses the max_square.py script to find the largest square number, if any, for a given anagram word pair. Let's submit jobs that run max_square.py for all of the anagram word pairs (i.e., one job per word pair), that you found in the previous section: Submit the accompanying squares.sub file from the tarball Resolve any issues that you encounter until you receive output for each job. Note that some jobs may have empty output since not all anagram word pairs are square anagram word pairs. Next, you can find the largest square among your output by directly using the command line. For example, if all of your job output has been placed in the squares directory and are named square-1.out , square-2.out , etc., then you could run the following command to find the largest square: user@server $ cat squares/square-*.out | sort -n | tail -n 1 You can check if you have the right answer with any of the OSG staff or by submitting the answer to Project Euler (requires an account).","title":"Finding the largest square"},{"location":"materials/troubleshooting/part1-ex1-troubleshooting/#answer-key","text":"There is also a working solution on our web server that can be retrieved with user@server $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/troubleshooting-key.tar.gz It contains comments labeled SOLUTION that you can consult in case you get stuck. Like any answer key, it is mainly useful as a verification tool, so try to only use it as a last resort or for detailed explanations to improve your understanding.","title":"Answer Key"},{"location":"materials/troubleshooting/part1-ex2-job-retry/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Exercise 1.2: Retries \u00b6 The goal of this exercise is to demonstrate running a job that intermittently fails and thus could benefit from having HTCondor automatically retry it. This first part of the exercise should take only a few minutes, and is designed to setup the next exercises. Bad Job \u00b6 Let\u2019s assume that a colleague has shared with you a program, and it fails once in a while. In the real world, we would probably just fix the program, but what if you cannot change the software? Unfortunately, this situation happens more often than we would like. Below is a Python script that fails once in a while. We will not fix it, but instead use it to simulate a program that can fail and that we cannot fix. #!/usr/bin/env python3 # murphy.py simulates a real program with real problems import random import sys import time # For one out of every three attempts, simulate a runtime error if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. In a new directory for this exercise, save the script above as murphy.py . Write a submit file for the script; queue 20 instances of the job and be sure to ask for 20 MB of memory and disk. Submit the file, note the ClusterId, and wait for the jobs to finish. What output do you expect? What output did you get? If you are curious about the exit code from the job, it is saved in completed jobs in condor_history in the ExitCode attribute. The following command will show the ExitCode for a given cluster of jobs: user@server $ condor_history -af:h ProcId ExitCode (Be sure to replace with your actual cluster ID. The command may take a minute or so to complete.) How many of the jobs succeeded? How many failed? Retrying Failed Jobs \u00b6 Now let\u2019s see if we can solve the problem of jobs that fail once in a while. In this particular case, if HTCondor runs a failed job again, it has a good chance of succeeding. Not all failing jobs are like this, but in this case it is a reasonable assumption. HTcondor has a feature named max_retries that allows to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Try implementing this feature. Did your change work? After the jobs have finished, examine the log file(s) to see what happened in detail. Did any jobs need to be restarted? Another way to see how many restarts there were is to look at the NumJobStarts attribute of a completed job with the condor_history command, in the same way you looked at the ExitCode attribute earlier. Does the number of retries seem correct? For those jobs which did need to be retried, what is their ExitCode ; and what about the ExitCode from earlier execution attempts? A (Too) Long Running Job \u00b6 Sometimes, an ill-behaved job will get stuck in a loop and run forever, instead of exiting with a failure code, and it may just need to be re-run (or run on a different execute server) to complete without getting stuck. We can modify our Python program to simulate this kind of bad job with the following file: #!/usr/bin/env python3 # murphy.py simulate a real program with real problems import random import sys import time # For one out of every three attempts, simulate an \"infinite\" loop if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output time . sleep ( 3600 ) sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. Save the script to a new file named murphy2.py . Copy your previous submit file to a new name and change the executable to murphy2.py . If you like, submit the new file \u2014 but after a while be sure to remove the whole cluster to clear out the \u201chung\u201d jobs. Now try to change the submit file to automatically remove any jobs that run for more than one minute. You can make this change with just a single line in your submit file periodic_remove = (JobStatus == 2) && ( (CurrentTime - EnteredCurrentStatus) > 60 ) Submit the new file. Do the long running jobs get removed? What does condor_history show for the cluster after all jobs are done? Which job status (i.e. idle, held, running) do you think JobStatus == 2 corresponds to? Bonus Exercise \u00b6 If you have time, edit your submit file so that instead of removing long running jobs, HTCondor will automatically put the long-running job on hold, and then automatically release it.","title":"1.2 - Job Retry"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#exercise-12-retries","text":"The goal of this exercise is to demonstrate running a job that intermittently fails and thus could benefit from having HTCondor automatically retry it. This first part of the exercise should take only a few minutes, and is designed to setup the next exercises.","title":"Exercise 1.2: Retries"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#bad-job","text":"Let\u2019s assume that a colleague has shared with you a program, and it fails once in a while. In the real world, we would probably just fix the program, but what if you cannot change the software? Unfortunately, this situation happens more often than we would like. Below is a Python script that fails once in a while. We will not fix it, but instead use it to simulate a program that can fail and that we cannot fix. #!/usr/bin/env python3 # murphy.py simulates a real program with real problems import random import sys import time # For one out of every three attempts, simulate a runtime error if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. In a new directory for this exercise, save the script above as murphy.py . Write a submit file for the script; queue 20 instances of the job and be sure to ask for 20 MB of memory and disk. Submit the file, note the ClusterId, and wait for the jobs to finish. What output do you expect? What output did you get? If you are curious about the exit code from the job, it is saved in completed jobs in condor_history in the ExitCode attribute. The following command will show the ExitCode for a given cluster of jobs: user@server $ condor_history -af:h ProcId ExitCode (Be sure to replace with your actual cluster ID. The command may take a minute or so to complete.) How many of the jobs succeeded? How many failed?","title":"Bad Job"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#retrying-failed-jobs","text":"Now let\u2019s see if we can solve the problem of jobs that fail once in a while. In this particular case, if HTCondor runs a failed job again, it has a good chance of succeeding. Not all failing jobs are like this, but in this case it is a reasonable assumption. HTcondor has a feature named max_retries that allows to retry any job with a non-zero exit code up to 5 times, then resubmit the jobs. Try implementing this feature. Did your change work? After the jobs have finished, examine the log file(s) to see what happened in detail. Did any jobs need to be restarted? Another way to see how many restarts there were is to look at the NumJobStarts attribute of a completed job with the condor_history command, in the same way you looked at the ExitCode attribute earlier. Does the number of retries seem correct? For those jobs which did need to be retried, what is their ExitCode ; and what about the ExitCode from earlier execution attempts?","title":"Retrying Failed Jobs"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#a-too-long-running-job","text":"Sometimes, an ill-behaved job will get stuck in a loop and run forever, instead of exiting with a failure code, and it may just need to be re-run (or run on a different execute server) to complete without getting stuck. We can modify our Python program to simulate this kind of bad job with the following file: #!/usr/bin/env python3 # murphy.py simulate a real program with real problems import random import sys import time # For one out of every three attempts, simulate an \"infinite\" loop if random . randint ( 0 , 2 ) == 0 : # Intentionally don't print any output time . sleep ( 3600 ) sys . exit ( 15 ) else : time . sleep ( 3 ) print ( \"All work done correctly\" ) # By convention, zero exit code means success sys . exit ( 0 ) Let\u2019s see what happens when a program like this one is run in HTCondor. Save the script to a new file named murphy2.py . Copy your previous submit file to a new name and change the executable to murphy2.py . If you like, submit the new file \u2014 but after a while be sure to remove the whole cluster to clear out the \u201chung\u201d jobs. Now try to change the submit file to automatically remove any jobs that run for more than one minute. You can make this change with just a single line in your submit file periodic_remove = (JobStatus == 2) && ( (CurrentTime - EnteredCurrentStatus) > 60 ) Submit the new file. Do the long running jobs get removed? What does condor_history show for the cluster after all jobs are done? Which job status (i.e. idle, held, running) do you think JobStatus == 2 corresponds to?","title":"A (Too) Long Running Job"},{"location":"materials/troubleshooting/part1-ex2-job-retry/#bonus-exercise","text":"If you have time, edit your submit file so that instead of removing long running jobs, HTCondor will automatically put the long-running job on hold, and then automatically release it.","title":"Bonus Exercise"},{"location":"materials/workflows/part1-ex1-simple-dag/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Workflows Exercise 1.1: Coordinating a Set of Jobs With a Simple DAG \u00b6 The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job. What is DAGMan? \u00b6 In short, DAGMan lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters: DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the HTCondor manual . Submitting a Simple DAG \u00b6 For our job, we will return briefly to the sleep program, name it job.sub executable = /bin/sleep arguments = 4 log = simple.log output = simple.out error = simple.error request_memory = 1GB request_disk = 1GB request_cpus = 1 queue We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. First we will create the most minimal DAG that can be created: a DAG with just one node. Put this into a file named simple.dag . JOB Simple job.sub In your first window, submit the DAG: username@ap40 $ condor_submit_dag simple.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library output : simple.dag.lib.out Log of Condor library error messages : simple.dag.lib.err Log of the life of condor_dagman itself : simple.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 61. ----------------------------------------------------------------------- In the second window, check the queue (what you see may be slightly different): username@ap40 $ condor_q -nobatch -wide:80 -- Submitter: learn.chtc.wisc.edu : <128.104.100.55:9618?sock=28867_10e4_2> : learn.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 61.0 roy 6/21 22:51 0+00:03:47 R 0 0.3 condor_dagman 62.0 roy 6/21 22:51 0+00:00:03 R 0 0.7 simple 4 10 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended In the third window, watch what DAGMan does (what you see may be slightly different): username@ap40 $ tail -f --lines = 500 simple.dag.dagman.out 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 ** condor_scheduniv_exec.271100.0 (CONDOR_DAGMAN) STARTING UP 08/02/24 15:44:57 ** /usr/bin/condor_dagman 08/02/24 15:44:57 ** SubsystemInfo: name=DAGMAN type=DAGMAN(9) class=CLIENT(2) 08/02/24 15:44:57 ** Configuration: subsystem:DAGMAN local: class:CLIENT 08/02/24 15:44:57 ** $CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ 08/02/24 15:44:57 ** $CondorPlatform: x86_64_AlmaLinux8 $ 08/02/24 15:44:57 ** PID = 2340103 08/02/24 15:44:57 ** Log last touched time unavailable (No such file or directory) 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 08/02/24 15:44:57 DaemonCore: No command port requested. 08/02/24 15:44:57 DAGMAN_USE_STRICT setting: 1 08/02/24 15:44:57 DAGMAN_VERBOSITY setting: 3 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DELAY setting: 0 08/02/24 15:44:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 08/02/24 15:44:57 DAGMAN_STARTUP_CYCLE_DETECT setting: False 08/02/24 15:44:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 100 08/02/24 15:44:57 DAGMAN_AGGRESSIVE_SUBMIT setting: False 08/02/24 15:44:57 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 08/02/24 15:44:57 DAGMAN_QUEUE_UPDATE_INTERVAL setting: 300 08/02/24 15:44:57 DAGMAN_DEFAULT_PRIORITY setting: 0 08/02/24 15:44:57 DAGMAN_SUPPRESS_NOTIFICATION setting: True 08/02/24 15:44:57 allow_events (DAGMAN_ALLOW_EVENTS) setting: 114 08/02/24 15:44:57 DAGMAN_RETRY_SUBMIT_FIRST setting: True 08/02/24 15:44:57 DAGMAN_RETRY_NODE_FIRST setting: False 08/02/24 15:44:57 DAGMAN_MAX_JOBS_IDLE setting: 1000 08/02/24 15:44:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 08/02/24 15:44:57 DAGMAN_MAX_PRE_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_POST_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_HOLD_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MUNGE_NODE_NAMES setting: True 08/02/24 15:44:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: False 08/02/24 15:44:57 DAGMAN_ALWAYS_RUN_POST setting: False 08/02/24 15:44:57 DAGMAN_CONDOR_SUBMIT_EXE setting: /usr/bin/condor_submit 08/02/24 15:44:57 DAGMAN_USE_DIRECT_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_APPEND_VARS setting: False 08/02/24 15:44:57 DAGMAN_ABORT_DUPLICATES setting: True 08/02/24 15:44:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 08/02/24 15:44:57 DAGMAN_AUTO_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_MAX_RESCUE_NUM setting: 100 08/02/24 15:44:57 DAGMAN_WRITE_PARTIAL_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log 08/02/24 15:44:57 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True 08/02/24 15:44:57 DAGMAN_MAX_JOB_HOLDS setting: 100 08/02/24 15:44:57 DAGMAN_HOLD_CLAIM_TIME setting: 20 08/02/24 15:44:57 ALL_DEBUG setting: 08/02/24 15:44:57 DAGMAN_DEBUG setting: 08/02/24 15:44:57 DAGMAN_SUPPRESS_JOB_LOGS setting: False 08/02/24 15:44:57 DAGMAN_REMOVE_NODE_JOBS setting: True 08/02/24 15:44:57 DAGMAN will adjust edges after parsing 08/02/24 15:44:57 argv[0] == \"condor_scheduniv_exec.271100.0\" 08/02/24 15:44:57 argv[1] == \"-Lockfile\" 08/02/24 15:44:57 argv[2] == \"simple.dag.lock\" 08/02/24 15:44:57 argv[3] == \"-AutoRescue\" 08/02/24 15:44:57 argv[4] == \"1\" 08/02/24 15:44:57 argv[5] == \"-DoRescueFrom\" 08/02/24 15:44:57 argv[6] == \"0\" 08/02/24 15:44:57 argv[7] == \"-Dag\" 08/02/24 15:44:57 argv[8] == \"simple.dag\" 08/02/24 15:44:57 argv[9] == \"-Suppress_notification\" 08/02/24 15:44:57 argv[10] == \"-CsdVersion\" 08/02/24 15:44:57 argv[11] == \"$CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $\" 08/02/24 15:44:57 argv[12] == \"-Dagman\" 08/02/24 15:44:57 argv[13] == \"/usr/bin/condor_dagman\" 08/02/24 15:44:57 Default node log file is: 08/02/24 15:44:57 DAG Lockfile will be written to simple.dag.lock 08/02/24 15:44:57 DAG Input file is simple.dag 08/02/24 15:44:57 Parsing 1 dagfiles 08/02/24 15:44:57 Parsing simple.dag ... 08/02/24 15:44:57 Adjusting edges 08/02/24 15:44:57 Dag contains 1 total jobs 08/02/24 15:44:57 Bootstrapping... 08/02/24 15:44:57 Number of pre-completed nodes: 0 08/02/24 15:44:57 MultiLogFiles: truncating log file /home/mats.rynge/dagman-1/./simple.dag.nodes.log 08/02/24 15:44:57 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:44:57 Of 1 nodes total: 08/02/24 15:44:57 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:44:57 === === === === === === === === 08/02/24 15:44:57 0 0 0 0 1 0 0 0 08/02/24 15:44:57 0 job proc(s) currently held 08/02/24 15:44:57 Registering condor_event_timer... 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... Here's where the job is submitted 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... 08/02/24 15:44:58 Submitting node Simple from file job.sub using direct job submission 08/02/24 15:44:58 assigned HTCondor ID (271101.0.0) 08/02/24 15:44:58 Just submitted 1 job this cycle... Here's where DAGMan noticed that the job is running 08/02/24 15:45:18 Event: ULOG_EXECUTE for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:14} 08/02/24 15:45:18 Number of idle job procs: 0 Here's where DAGMan noticed that the job finished. 08/02/24 15:45:23 Event: ULOG_JOB_TERMINATED for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:19} 08/02/24 15:45:23 Number of idle job procs: 0 08/02/24 15:45:23 Node Simple job proc (271101.0.0) completed successfully. 08/02/24 15:45:23 Node Simple job completed 08/02/24 15:45:23 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:45:23 Of 1 nodes total: 08/02/24 15:45:23 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:45:23 === === === === === === === === 08/02/24 15:45:23 1 0 0 0 0 0 0 0 Here's where DAGMan noticed that all the work is done. 08/02/24 15:45:23 All jobs Completed! 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxJobs limit (0) 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxIdle limit (1000) 08/02/24 15:45:23 Note: 0 total job deferrals because of node category throttles 08/02/24 15:45:23 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total HOLD script deferrals because of -MaxHold limit (20) or DEFER Now verify your results: username@ap40 $ cat simple.log 000 (271101.000.000) 2024-08-02 15:44:58 Job submitted from host: <128.105.68.92:9618?addrs=128.105.68.92-9618+[2607-f388-2200-100-eaeb-d3ff-fe40-111c]-9618&alias=ap40.uw.osg-htc.org&noUDP&sock=schedd_35391_dc5c> DAG Node: Simple ... 040 (271101.000.000) 2024-08-02 15:45:13 Started transferring input files Transferring to host: <10.136.81.233:37425?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector4#23067238%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b6]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1512850&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-37425&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> ... 040 (271101.000.000) 2024-08-02 15:45:13 Finished transferring input files ... 021 (271101.000.000) 2024-08-02 15:45:14 Warning from starter on slot1_4@glidein_2635188_104012775@comp-cc-0463.gwave.ics.psu.edu: PREPARE_JOB (prepare-hook) succeeded (reported status 000): Using default Singularity image /cvmfs/singularity.opensciencegrid.org/htc/rocky:8-cuda-11.0.3 ... 001 (271101.000.000) 2024-08-02 15:45:14 Job executing on host: <10.136.81.233:39645?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector10#1506459%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b4]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1506644&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-39645&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> SlotName: slot1_4@comp-cc-0463.gwave.ics.psu.edu CondorScratchDir = \"/localscratch/condor/execute/dir_2635172/glide_uZ6qXM/execute/dir_3252113\" Cpus = 1 Disk = 2699079 GLIDEIN_ResourceName = \"PSU-LIGO\" Memory = 1024 ... 006 (271101.000.000) 2024-08-02 15:45:19 Image size of job updated: 2296464 47 - MemoryUsage of job (MB) 47684 - ResidentSetSize of job (KB) ... 040 (271101.000.000) 2024-08-02 15:45:19 Started transferring output files ... 040 (271101.000.000) 2024-08-02 15:45:19 Finished transferring output files ... 005 (271101.000.000) 2024-08-02 15:45:19 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 38416 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 38416 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 149 1048576 2699079 Memory (MB) : 47 1024 1024 Job terminated of its own accord at 2024-08-02T20:45:19Z with exit-code 0. ... Looking at DAGMan's various files, we see that DAGMan itself ran as a job (specifically, a \"scheduler\" universe job). username@ap40 $ ls simple.dag.* simple.dag.condor.sub simple.dag.dagman.log simple.dag.dagman.out simple.dag.lib.err simple.dag.lib.out username@ap40 $ cat simple.dag.condor.sub # Filename: simple.dag.condor.sub # Generated by condor_submit_dag simple.dag universe = scheduler executable = /usr/bin/condor_dagman getenv = CONDOR_CONFIG,_CONDOR_*,PATH,PYTHONPATH,PERL*,PEGASUS_*,TZ,HOME,USER,LANG,LC_ALL output = simple.dag.lib.out error = simple.dag.lib.err log = simple.dag.dagman.log remove_kill_sig = SIGUSR1 +OtherJobRemoveRequirements = \"DAGManJobId =?= $(cluster)\" # Note: default on_exit_remove expression: # ( ExitSignal = ? = 11 || ( ExitCode = ! = UNDEFINED && ExitCode > = 0 && ExitCode < = 2 )) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed ( e.g., during a reboot ) . on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False arguments = \"-p 0 -f -l . -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -Suppress_notification -CsdVersion $CondorVersion:' '10.7.0' '2024-07-10' 'BuildID:' '659788' 'PackageID:' '10.7.0-0.659788' 'RC' '$ -Dagman /usr/bin/condor_dagman\" environment = \"_CONDOR_DAGMAN_LOG=simple.dag.dagman.out _CONDOR_MAX_DAGMAN_LOG=0 _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address _CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad\" queue If you want to clean up some of these files (you may not want to, at least not yet), run: username@ap40 $ rm simple.dag.* Challenge \u00b6 What is the scheduler universe? Why does DAGMan use it? Show hint HTCondor has several universes What would happen to your DAGMan workflow if the access point has to be rebooted? Jobs in the HTCondor queue are \"managed\" - they are always tracked, and restarted automatically if needed","title":"1.1 - A simple DAG"},{"location":"materials/workflows/part1-ex1-simple-dag/#workflows-exercise-11-coordinating-a-set-of-jobs-with-a-simple-dag","text":"The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job.","title":"Workflows Exercise 1.1: Coordinating a Set of Jobs With a Simple DAG"},{"location":"materials/workflows/part1-ex1-simple-dag/#what-is-dagman","text":"In short, DAGMan lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters: DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the HTCondor manual .","title":"What is DAGMan?"},{"location":"materials/workflows/part1-ex1-simple-dag/#submitting-a-simple-dag","text":"For our job, we will return briefly to the sleep program, name it job.sub executable = /bin/sleep arguments = 4 log = simple.log output = simple.out error = simple.error request_memory = 1GB request_disk = 1GB request_cpus = 1 queue We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. First we will create the most minimal DAG that can be created: a DAG with just one node. Put this into a file named simple.dag . JOB Simple job.sub In your first window, submit the DAG: username@ap40 $ condor_submit_dag simple.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library output : simple.dag.lib.out Log of Condor library error messages : simple.dag.lib.err Log of the life of condor_dagman itself : simple.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 61. ----------------------------------------------------------------------- In the second window, check the queue (what you see may be slightly different): username@ap40 $ condor_q -nobatch -wide:80 -- Submitter: learn.chtc.wisc.edu : <128.104.100.55:9618?sock=28867_10e4_2> : learn.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 61.0 roy 6/21 22:51 0+00:03:47 R 0 0.3 condor_dagman 62.0 roy 6/21 22:51 0+00:00:03 R 0 0.7 simple 4 10 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended In the third window, watch what DAGMan does (what you see may be slightly different): username@ap40 $ tail -f --lines = 500 simple.dag.dagman.out 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 ** condor_scheduniv_exec.271100.0 (CONDOR_DAGMAN) STARTING UP 08/02/24 15:44:57 ** /usr/bin/condor_dagman 08/02/24 15:44:57 ** SubsystemInfo: name=DAGMAN type=DAGMAN(9) class=CLIENT(2) 08/02/24 15:44:57 ** Configuration: subsystem:DAGMAN local: class:CLIENT 08/02/24 15:44:57 ** $CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $ 08/02/24 15:44:57 ** $CondorPlatform: x86_64_AlmaLinux8 $ 08/02/24 15:44:57 ** PID = 2340103 08/02/24 15:44:57 ** Log last touched time unavailable (No such file or directory) 08/02/24 15:44:57 ****************************************************** 08/02/24 15:44:57 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 08/02/24 15:44:57 DaemonCore: No command port requested. 08/02/24 15:44:57 DAGMAN_USE_STRICT setting: 1 08/02/24 15:44:57 DAGMAN_VERBOSITY setting: 3 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 08/02/24 15:44:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DELAY setting: 0 08/02/24 15:44:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 08/02/24 15:44:57 DAGMAN_STARTUP_CYCLE_DETECT setting: False 08/02/24 15:44:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 100 08/02/24 15:44:57 DAGMAN_AGGRESSIVE_SUBMIT setting: False 08/02/24 15:44:57 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 08/02/24 15:44:57 DAGMAN_QUEUE_UPDATE_INTERVAL setting: 300 08/02/24 15:44:57 DAGMAN_DEFAULT_PRIORITY setting: 0 08/02/24 15:44:57 DAGMAN_SUPPRESS_NOTIFICATION setting: True 08/02/24 15:44:57 allow_events (DAGMAN_ALLOW_EVENTS) setting: 114 08/02/24 15:44:57 DAGMAN_RETRY_SUBMIT_FIRST setting: True 08/02/24 15:44:57 DAGMAN_RETRY_NODE_FIRST setting: False 08/02/24 15:44:57 DAGMAN_MAX_JOBS_IDLE setting: 1000 08/02/24 15:44:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 08/02/24 15:44:57 DAGMAN_MAX_PRE_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_POST_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MAX_HOLD_SCRIPTS setting: 20 08/02/24 15:44:57 DAGMAN_MUNGE_NODE_NAMES setting: True 08/02/24 15:44:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: False 08/02/24 15:44:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: False 08/02/24 15:44:57 DAGMAN_ALWAYS_RUN_POST setting: False 08/02/24 15:44:57 DAGMAN_CONDOR_SUBMIT_EXE setting: /usr/bin/condor_submit 08/02/24 15:44:57 DAGMAN_USE_DIRECT_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_APPEND_VARS setting: False 08/02/24 15:44:57 DAGMAN_ABORT_DUPLICATES setting: True 08/02/24 15:44:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True 08/02/24 15:44:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 08/02/24 15:44:57 DAGMAN_AUTO_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_MAX_RESCUE_NUM setting: 100 08/02/24 15:44:57 DAGMAN_WRITE_PARTIAL_RESCUE setting: True 08/02/24 15:44:57 DAGMAN_DEFAULT_NODE_LOG setting: @(DAG_DIR)/@(DAG_FILE).nodes.log 08/02/24 15:44:57 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True 08/02/24 15:44:57 DAGMAN_MAX_JOB_HOLDS setting: 100 08/02/24 15:44:57 DAGMAN_HOLD_CLAIM_TIME setting: 20 08/02/24 15:44:57 ALL_DEBUG setting: 08/02/24 15:44:57 DAGMAN_DEBUG setting: 08/02/24 15:44:57 DAGMAN_SUPPRESS_JOB_LOGS setting: False 08/02/24 15:44:57 DAGMAN_REMOVE_NODE_JOBS setting: True 08/02/24 15:44:57 DAGMAN will adjust edges after parsing 08/02/24 15:44:57 argv[0] == \"condor_scheduniv_exec.271100.0\" 08/02/24 15:44:57 argv[1] == \"-Lockfile\" 08/02/24 15:44:57 argv[2] == \"simple.dag.lock\" 08/02/24 15:44:57 argv[3] == \"-AutoRescue\" 08/02/24 15:44:57 argv[4] == \"1\" 08/02/24 15:44:57 argv[5] == \"-DoRescueFrom\" 08/02/24 15:44:57 argv[6] == \"0\" 08/02/24 15:44:57 argv[7] == \"-Dag\" 08/02/24 15:44:57 argv[8] == \"simple.dag\" 08/02/24 15:44:57 argv[9] == \"-Suppress_notification\" 08/02/24 15:44:57 argv[10] == \"-CsdVersion\" 08/02/24 15:44:57 argv[11] == \"$CondorVersion: 10.7.0 2024-07-10 BuildID: 659788 PackageID: 10.7.0-0.659788 RC $\" 08/02/24 15:44:57 argv[12] == \"-Dagman\" 08/02/24 15:44:57 argv[13] == \"/usr/bin/condor_dagman\" 08/02/24 15:44:57 Default node log file is: 08/02/24 15:44:57 DAG Lockfile will be written to simple.dag.lock 08/02/24 15:44:57 DAG Input file is simple.dag 08/02/24 15:44:57 Parsing 1 dagfiles 08/02/24 15:44:57 Parsing simple.dag ... 08/02/24 15:44:57 Adjusting edges 08/02/24 15:44:57 Dag contains 1 total jobs 08/02/24 15:44:57 Bootstrapping... 08/02/24 15:44:57 Number of pre-completed nodes: 0 08/02/24 15:44:57 MultiLogFiles: truncating log file /home/mats.rynge/dagman-1/./simple.dag.nodes.log 08/02/24 15:44:57 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:44:57 Of 1 nodes total: 08/02/24 15:44:57 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:44:57 === === === === === === === === 08/02/24 15:44:57 0 0 0 0 1 0 0 0 08/02/24 15:44:57 0 job proc(s) currently held 08/02/24 15:44:57 Registering condor_event_timer... 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... Here's where the job is submitted 08/02/24 15:44:58 Submitting HTCondor Node Simple job(s)... 08/02/24 15:44:58 Submitting node Simple from file job.sub using direct job submission 08/02/24 15:44:58 assigned HTCondor ID (271101.0.0) 08/02/24 15:44:58 Just submitted 1 job this cycle... Here's where DAGMan noticed that the job is running 08/02/24 15:45:18 Event: ULOG_EXECUTE for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:14} 08/02/24 15:45:18 Number of idle job procs: 0 Here's where DAGMan noticed that the job finished. 08/02/24 15:45:23 Event: ULOG_JOB_TERMINATED for HTCondor Node Simple (271101.0.0) {08/02/24 15:45:19} 08/02/24 15:45:23 Number of idle job procs: 0 08/02/24 15:45:23 Node Simple job proc (271101.0.0) completed successfully. 08/02/24 15:45:23 Node Simple job completed 08/02/24 15:45:23 DAG status: 0 (DAG_STATUS_OK) 08/02/24 15:45:23 Of 1 nodes total: 08/02/24 15:45:23 Done Pre Queued Post Ready Un-Ready Failed Futile 08/02/24 15:45:23 === === === === === === === === 08/02/24 15:45:23 1 0 0 0 0 0 0 0 Here's where DAGMan noticed that all the work is done. 08/02/24 15:45:23 All jobs Completed! 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxJobs limit (0) 08/02/24 15:45:23 Note: 0 total job deferrals because of -MaxIdle limit (1000) 08/02/24 15:45:23 Note: 0 total job deferrals because of node category throttles 08/02/24 15:45:23 Note: 0 total PRE script deferrals because of -MaxPre limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total POST script deferrals because of -MaxPost limit (20) or DEFER 08/02/24 15:45:23 Note: 0 total HOLD script deferrals because of -MaxHold limit (20) or DEFER Now verify your results: username@ap40 $ cat simple.log 000 (271101.000.000) 2024-08-02 15:44:58 Job submitted from host: <128.105.68.92:9618?addrs=128.105.68.92-9618+[2607-f388-2200-100-eaeb-d3ff-fe40-111c]-9618&alias=ap40.uw.osg-htc.org&noUDP&sock=schedd_35391_dc5c> DAG Node: Simple ... 040 (271101.000.000) 2024-08-02 15:45:13 Started transferring input files Transferring to host: <10.136.81.233:37425?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector4#23067238%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b6]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1512850&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-37425&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> ... 040 (271101.000.000) 2024-08-02 15:45:13 Finished transferring input files ... 021 (271101.000.000) 2024-08-02 15:45:14 Warning from starter on slot1_4@glidein_2635188_104012775@comp-cc-0463.gwave.ics.psu.edu: PREPARE_JOB (prepare-hook) succeeded (reported status 000): Using default Singularity image /cvmfs/singularity.opensciencegrid.org/htc/rocky:8-cuda-11.0.3 ... 001 (271101.000.000) 2024-08-02 15:45:14 Job executing on host: <10.136.81.233:39645?CCBID=128.104.103.162:9618%3faddrs%3d128.104.103.162-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector10#1506459%20192.170.231.9:9618%3faddrs%3d192.170.231.9-9618+[fd85-ee78-d8a6-8607--1-73b4]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector10#1506644&PrivNet=comp-cc-0463.gwave.ics.psu.edu&addrs=10.136.81.233-39645&alias=comp-cc-0463.gwave.ics.psu.edu&noUDP> SlotName: slot1_4@comp-cc-0463.gwave.ics.psu.edu CondorScratchDir = \"/localscratch/condor/execute/dir_2635172/glide_uZ6qXM/execute/dir_3252113\" Cpus = 1 Disk = 2699079 GLIDEIN_ResourceName = \"PSU-LIGO\" Memory = 1024 ... 006 (271101.000.000) 2024-08-02 15:45:19 Image size of job updated: 2296464 47 - MemoryUsage of job (MB) 47684 - ResidentSetSize of job (KB) ... 040 (271101.000.000) 2024-08-02 15:45:19 Started transferring output files ... 040 (271101.000.000) 2024-08-02 15:45:19 Finished transferring output files ... 005 (271101.000.000) 2024-08-02 15:45:19 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 38416 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 38416 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 149 1048576 2699079 Memory (MB) : 47 1024 1024 Job terminated of its own accord at 2024-08-02T20:45:19Z with exit-code 0. ... Looking at DAGMan's various files, we see that DAGMan itself ran as a job (specifically, a \"scheduler\" universe job). username@ap40 $ ls simple.dag.* simple.dag.condor.sub simple.dag.dagman.log simple.dag.dagman.out simple.dag.lib.err simple.dag.lib.out username@ap40 $ cat simple.dag.condor.sub # Filename: simple.dag.condor.sub # Generated by condor_submit_dag simple.dag universe = scheduler executable = /usr/bin/condor_dagman getenv = CONDOR_CONFIG,_CONDOR_*,PATH,PYTHONPATH,PERL*,PEGASUS_*,TZ,HOME,USER,LANG,LC_ALL output = simple.dag.lib.out error = simple.dag.lib.err log = simple.dag.dagman.log remove_kill_sig = SIGUSR1 +OtherJobRemoveRequirements = \"DAGManJobId =?= $(cluster)\" # Note: default on_exit_remove expression: # ( ExitSignal = ? = 11 || ( ExitCode = ! = UNDEFINED && ExitCode > = 0 && ExitCode < = 2 )) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed ( e.g., during a reboot ) . on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False arguments = \"-p 0 -f -l . -Lockfile simple.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag simple.dag -Suppress_notification -CsdVersion $CondorVersion:' '10.7.0' '2024-07-10' 'BuildID:' '659788' 'PackageID:' '10.7.0-0.659788' 'RC' '$ -Dagman /usr/bin/condor_dagman\" environment = \"_CONDOR_DAGMAN_LOG=simple.dag.dagman.out _CONDOR_MAX_DAGMAN_LOG=0 _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address _CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad\" queue If you want to clean up some of these files (you may not want to, at least not yet), run: username@ap40 $ rm simple.dag.*","title":"Submitting a Simple DAG"},{"location":"materials/workflows/part1-ex1-simple-dag/#challenge","text":"What is the scheduler universe? Why does DAGMan use it? Show hint HTCondor has several universes What would happen to your DAGMan workflow if the access point has to be rebooted? Jobs in the HTCondor queue are \"managed\" - they are always tracked, and restarted automatically if needed","title":"Challenge"},{"location":"materials/workflows/part1-ex2-mandelbrot/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Workflows Exercise 1.2: A Brief Detour Through the Mandelbrot Set \u00b6 Before we explore using DAGs to implement workflows, let\u2019s get a more interesting job. Let\u2019s make pretty pictures! We have a small program that draws pictures of the Mandelbrot set. You can read about the Mandelbrot set on Wikipedia , or you can simply appreciate the pretty pictures. It\u2019s a fractal. We have a simple program that can draw the Mandelbrot set. It's called goatbrot . Before beginning, ensure that you are connected to ap40.uw.osg-htc.org . Create a directory for this exercise and cd into it. Running goatbrot From the Command Line \u00b6 You can generate the Mandelbrot set as a quick test with two simple commands. Download the goatbrot executable: username@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/goatbrot username@ap40 $ chmod a+x goatbrot Generate a PPM image of the Mandelbrot set: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c 0 ,0 -w 3 -s 1000 ,1000 The goatbroat program takes several parameters. Let's break them down: -i 1000 The number of iterations. Bigger numbers generate more accurate images but are slower to run. -o tile_000000_000000.ppm The output file to generate. -c 0,0 The center point of the image. Here it is the point (0,0). -w 3 The width of the image. Here is 3. -s 1000,1000 The size of the final image. Here we generate a picture that is 1000 pixels wide and 1000 pixels tall. Convert the image to the JPEG format (using a built-in program called convert ): username@ap40 $ convert tile_000000_000000.ppm mandel.jpg Dividing the Work into Smaller Pieces \u00b6 The Mandelbrot set can take a while to create, particularly if you make the iterations large or the image size large. What if we broke the creation of the image into multiple invocations (an HTC approach!) then stitched them together? Once we do that, we can run each goatbroat in parallel in our cluster. Here's an example you can run by hand. Run goatbroat 4 times: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c -0.75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000001.ppm -c 0 .75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000000.ppm -c -0.75,-0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000001.ppm -c 0 .75,-0.75 -w 1 .5 -s 500 ,500 Stitch the small images together into the complete image (in JPEG format): username@ap40 $ montage tile_000000_000000.ppm tile_000000_000001.ppm tile_000001_000000.ppm tile_000001_000001.ppm -mode Concatenate -tile 2x2 mandel.jpg This will produce the same image as above. We divided the image space into a 2\u00d72 grid and ran goatbrot on each section of the grid. The built-in montage program stitches the files together and writes out the final image in JPEG format. View the Image! \u00b6 Run the commands above so that you have the Mandelbrot image. When you create the image, you might wonder how you can view it. Use scp or sftp to copy the mandel.jpg back to your computer to view it.","title":"1.2 - A brief detour through the Mandelbrot set"},{"location":"materials/workflows/part1-ex2-mandelbrot/#workflows-exercise-12-a-brief-detour-through-the-mandelbrot-set","text":"Before we explore using DAGs to implement workflows, let\u2019s get a more interesting job. Let\u2019s make pretty pictures! We have a small program that draws pictures of the Mandelbrot set. You can read about the Mandelbrot set on Wikipedia , or you can simply appreciate the pretty pictures. It\u2019s a fractal. We have a simple program that can draw the Mandelbrot set. It's called goatbrot . Before beginning, ensure that you are connected to ap40.uw.osg-htc.org . Create a directory for this exercise and cd into it.","title":"Workflows Exercise 1.2: A Brief Detour Through the Mandelbrot Set"},{"location":"materials/workflows/part1-ex2-mandelbrot/#running-goatbrot-from-the-command-line","text":"You can generate the Mandelbrot set as a quick test with two simple commands. Download the goatbrot executable: username@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/goatbrot username@ap40 $ chmod a+x goatbrot Generate a PPM image of the Mandelbrot set: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c 0 ,0 -w 3 -s 1000 ,1000 The goatbroat program takes several parameters. Let's break them down: -i 1000 The number of iterations. Bigger numbers generate more accurate images but are slower to run. -o tile_000000_000000.ppm The output file to generate. -c 0,0 The center point of the image. Here it is the point (0,0). -w 3 The width of the image. Here is 3. -s 1000,1000 The size of the final image. Here we generate a picture that is 1000 pixels wide and 1000 pixels tall. Convert the image to the JPEG format (using a built-in program called convert ): username@ap40 $ convert tile_000000_000000.ppm mandel.jpg","title":"Running goatbrot From the Command Line"},{"location":"materials/workflows/part1-ex2-mandelbrot/#dividing-the-work-into-smaller-pieces","text":"The Mandelbrot set can take a while to create, particularly if you make the iterations large or the image size large. What if we broke the creation of the image into multiple invocations (an HTC approach!) then stitched them together? Once we do that, we can run each goatbroat in parallel in our cluster. Here's an example you can run by hand. Run goatbroat 4 times: username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000000.ppm -c -0.75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000000_000001.ppm -c 0 .75,0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000000.ppm -c -0.75,-0.75 -w 1 .5 -s 500 ,500 username@ap40 $ ./goatbrot -i 1000 -o tile_000001_000001.ppm -c 0 .75,-0.75 -w 1 .5 -s 500 ,500 Stitch the small images together into the complete image (in JPEG format): username@ap40 $ montage tile_000000_000000.ppm tile_000000_000001.ppm tile_000001_000000.ppm tile_000001_000001.ppm -mode Concatenate -tile 2x2 mandel.jpg This will produce the same image as above. We divided the image space into a 2\u00d72 grid and ran goatbrot on each section of the grid. The built-in montage program stitches the files together and writes out the final image in JPEG format.","title":"Dividing the Work into Smaller Pieces"},{"location":"materials/workflows/part1-ex2-mandelbrot/#view-the-image","text":"Run the commands above so that you have the Mandelbrot image. When you create the image, you might wonder how you can view it. Use scp or sftp to copy the mandel.jpg back to your computer to view it.","title":"View the Image!"},{"location":"materials/workflows/part1-ex3-complex-dag/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Workflows Exercise 1.3: A More Complex DAG \u00b6 The objective of this exercise is to run a real set of jobs with DAGMan. Make Your Job Submission Files \u00b6 We'll run our goatbrot example. If you didn't read about it yet, please do so now . We are going to make a DAG with four simultaneous jobs ( goatbrot ) and one final node to stitch them together ( montage ). This means we have five jobs. We're going to run goatbrot with more iterations (100,000) so each job will take longer to run. You can create your five jobs. The goatbrot jobs are very similar to each other, but they have slightly different parameters and output files. goatbrot1.sub \u00b6 executable = goatbrot arguments = -i 100000 -c -0.75,0.75 -w 1.5 -s 500,500 -o tile_0_0.ppm log = goatbrot.log output = goatbrot.out.0.0 error = goatbrot.err.0.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue goatbrot2.sub \u00b6 executable = goatbrot arguments = -i 100000 -c 0.75,0.75 -w 1.5 -s 500,500 -o tile_0_1.ppm log = goatbrot.log output = goatbrot.out.0.1 error = goatbrot.err.0.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue goatbrot3.sub \u00b6 executable = goatbrot arguments = -i 100000 -c -0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_0.ppm log = goatbrot.log output = goatbrot.out.1.0 error = goatbrot.err.1.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue goatbrot4.sub \u00b6 executable = goatbrot arguments = -i 100000 -c 0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_1.ppm log = goatbrot.log output = goatbrot.out.1.1 error = goatbrot.err.1.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue montage.sub \u00b6 You should notice that the transfer_input_files statement refers to the files created by the other jobs. +SingularityImage = \"/cvmfs/singularity.opensciencegrid.org/htc/rocky:9\" executable = montage.sh arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandel-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Notice that the job specified by montage.sub uses a container image, as indicated by the +SingularityImage flag. This is because montage uses libraries that are not installed on the execution nodes. We use a container with montage installed and call it using the executable montage.sh ; thus we will need to create the file montage.sh . #!/bin/bash # Pass all arguments to montage montage \"$@\" Make your DAG \u00b6 In a file called goatbrot.dag , you have your DAG specification: JOB g1 goatbrot1.sub JOB g2 goatbrot2.sub JOB g3 goatbrot3.sub JOB g4 goatbrot4.sub JOB montage montage.sub PARENT g1 g2 g3 g4 CHILD montage Ask yourself: do you know how we ensure that all the goatbrot commands can run simultaneously and all of them will complete before we run the montage job? Running the DAG \u00b6 Submit your DAG: username@learn $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 71. ----------------------------------------------------------------------- Watch Your DAG \u00b6 Let\u2019s follow the progress of the whole DAG: Use the condor_watch_q command to keep an eye on the running jobs. See more information about this tool here . username@learn $ condor_watch_q If you're quick enough, you may have seen DAGMan running as the lone job, before it submitted additional job nodes: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 - 1 222059.0 [=============================================================================] Total: 1 jobs; 1 running Updated at 2024-07-28 13:52:57 DAGMan has submitted the goatbrot jobs, but they haven't started running yet BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 4 1 - 5 222059.0 ... 222063.0 [===============--------------------------------------------------------------] Total: 5 jobs; 4 idle, 1 running Updated at 2024-07-28 13:53:53 They're running BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 5 - 5 222059.0 ... 222063.0 [=============================================================================] Total: 5 jobs; 5 running Updated at 2024-07-28 13:54:33 They finished, but DAGMan hasn't noticed yet. It only checks periodically: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 4 - 5 222059.0 ... 222063.0 [##############################################################===============] Total: 5 jobs; 4 completed, 1 running Updated at 2024-07-28 13:55:13 Eventually, you'll see the montage job submitted, then running, then leave the queue, and then DAGMan will leave the queue. Examine your results. For some reason, goatbrot prints everything to stderr, not stdout. username@learn $ cat goatbrot.err.0.0 Complex image: Center: -0.75 + 0.75i Width: 1.5 Height: 1.5 Upper Left: -1.5 + 1.5i Lower Right: 0 + 0i Output image: Filename: tile_0_0.ppm Width, Height: 500, 500 Theme: beej Antialiased: no Mandelbrot: Max Iterations: 100000 Continuous: no Goatbrot: Multithreading: not supported in this build Completed: 100.0% Examine your log files ( goatbrot.log and montage.log ) and DAGMan output file ( goatbrot.dag.dagman.out ). Do they look as you expect? Can you see the progress of the DAG in the DAGMan output file? As you did earlier, transfer the resulting mandel-from-dag.jpg to your computer so that you can view the image. Does the image look correct? Clean up your results by removing all of the goatbrot.dag.* files if you like. Be careful to not delete the goatbrot.dag file. Bonus Challenge \u00b6 Re-run your DAG. When jobs are running, try condor_q -nobatch -dag . What does it do differently? Challenge, if you have time: Make a bigger DAG by making more tiles in the same area.","title":"1.3 - A more complex DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#workflows-exercise-13-a-more-complex-dag","text":"The objective of this exercise is to run a real set of jobs with DAGMan.","title":"Workflows Exercise 1.3: A More Complex DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#make-your-job-submission-files","text":"We'll run our goatbrot example. If you didn't read about it yet, please do so now . We are going to make a DAG with four simultaneous jobs ( goatbrot ) and one final node to stitch them together ( montage ). This means we have five jobs. We're going to run goatbrot with more iterations (100,000) so each job will take longer to run. You can create your five jobs. The goatbrot jobs are very similar to each other, but they have slightly different parameters and output files.","title":"Make Your Job Submission Files"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot1sub","text":"executable = goatbrot arguments = -i 100000 -c -0.75,0.75 -w 1.5 -s 500,500 -o tile_0_0.ppm log = goatbrot.log output = goatbrot.out.0.0 error = goatbrot.err.0.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot1.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot2sub","text":"executable = goatbrot arguments = -i 100000 -c 0.75,0.75 -w 1.5 -s 500,500 -o tile_0_1.ppm log = goatbrot.log output = goatbrot.out.0.1 error = goatbrot.err.0.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot2.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot3sub","text":"executable = goatbrot arguments = -i 100000 -c -0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_0.ppm log = goatbrot.log output = goatbrot.out.1.0 error = goatbrot.err.1.0 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot3.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#goatbrot4sub","text":"executable = goatbrot arguments = -i 100000 -c 0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_1.ppm log = goatbrot.log output = goatbrot.out.1.1 error = goatbrot.err.1.1 request_memory = 1GB request_disk = 1GB request_cpus = 1 queue","title":"goatbrot4.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#montagesub","text":"You should notice that the transfer_input_files statement refers to the files created by the other jobs. +SingularityImage = \"/cvmfs/singularity.opensciencegrid.org/htc/rocky:9\" executable = montage.sh arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandel-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Notice that the job specified by montage.sub uses a container image, as indicated by the +SingularityImage flag. This is because montage uses libraries that are not installed on the execution nodes. We use a container with montage installed and call it using the executable montage.sh ; thus we will need to create the file montage.sh . #!/bin/bash # Pass all arguments to montage montage \"$@\"","title":"montage.sub"},{"location":"materials/workflows/part1-ex3-complex-dag/#make-your-dag","text":"In a file called goatbrot.dag , you have your DAG specification: JOB g1 goatbrot1.sub JOB g2 goatbrot2.sub JOB g3 goatbrot3.sub JOB g4 goatbrot4.sub JOB montage montage.sub PARENT g1 g2 g3 g4 CHILD montage Ask yourself: do you know how we ensure that all the goatbrot commands can run simultaneously and all of them will complete before we run the montage job?","title":"Make your DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#running-the-dag","text":"Submit your DAG: username@learn $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 71. -----------------------------------------------------------------------","title":"Running the DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#watch-your-dag","text":"Let\u2019s follow the progress of the whole DAG: Use the condor_watch_q command to keep an eye on the running jobs. See more information about this tool here . username@learn $ condor_watch_q If you're quick enough, you may have seen DAGMan running as the lone job, before it submitted additional job nodes: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 - 1 222059.0 [=============================================================================] Total: 1 jobs; 1 running Updated at 2024-07-28 13:52:57 DAGMan has submitted the goatbrot jobs, but they haven't started running yet BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 4 1 - 5 222059.0 ... 222063.0 [===============--------------------------------------------------------------] Total: 5 jobs; 4 idle, 1 running Updated at 2024-07-28 13:53:53 They're running BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 5 - 5 222059.0 ... 222063.0 [=============================================================================] Total: 5 jobs; 5 running Updated at 2024-07-28 13:54:33 They finished, but DAGMan hasn't noticed yet. It only checks periodically: BATCH IDLE RUN DONE TOTAL JOB_IDS goatbrot.dag+222059 - 1 4 - 5 222059.0 ... 222063.0 [##############################################################===============] Total: 5 jobs; 4 completed, 1 running Updated at 2024-07-28 13:55:13 Eventually, you'll see the montage job submitted, then running, then leave the queue, and then DAGMan will leave the queue. Examine your results. For some reason, goatbrot prints everything to stderr, not stdout. username@learn $ cat goatbrot.err.0.0 Complex image: Center: -0.75 + 0.75i Width: 1.5 Height: 1.5 Upper Left: -1.5 + 1.5i Lower Right: 0 + 0i Output image: Filename: tile_0_0.ppm Width, Height: 500, 500 Theme: beej Antialiased: no Mandelbrot: Max Iterations: 100000 Continuous: no Goatbrot: Multithreading: not supported in this build Completed: 100.0% Examine your log files ( goatbrot.log and montage.log ) and DAGMan output file ( goatbrot.dag.dagman.out ). Do they look as you expect? Can you see the progress of the DAG in the DAGMan output file? As you did earlier, transfer the resulting mandel-from-dag.jpg to your computer so that you can view the image. Does the image look correct? Clean up your results by removing all of the goatbrot.dag.* files if you like. Be careful to not delete the goatbrot.dag file.","title":"Watch Your DAG"},{"location":"materials/workflows/part1-ex3-complex-dag/#bonus-challenge","text":"Re-run your DAG. When jobs are running, try condor_q -nobatch -dag . What does it do differently? Challenge, if you have time: Make a bigger DAG by making more tiles in the same area.","title":"Bonus Challenge"},{"location":"materials/workflows/part1-ex4-failed-dag/","text":"Workflows Exercise 1.4: Handling a DAG That Fails \u00b6 The objective of this exercise is to help you learn how DAGMan deals with job failures. DAGMan is built to help you recover from such failures. Background \u00b6 DAGMan can handle a situation where some of the nodes in a DAG fail. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed. Breaking Things \u00b6 Recall that DAGMan decides that a jobs fails if its exit code is non-zero. Let's modify our montage job so that it fails. Work in the same directory where you did the last DAG. Edit montage.sub to add a -h to the arguments. It will look like this, with the -h at the beginning of the highlighted line: executable = /usr/bin/montage arguments = -h tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Submit the DAG again: username@ap40 $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 77. ----------------------------------------------------------------------- Use watch to watch the jobs until they finish. In a separate window, use tail --lines=500 -f goatbrot.dag.dagman.out to watch what DAGMan does. 06/22/24 17:57:41 Setting maximum accepts per cycle 8. 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP 06/22/24 17:57:41 ** /usr/bin/condor_dagman 06/22/24 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/22/24 17:57:41 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/22/24 17:57:41 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/22/24 17:57:41 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/22/24 17:57:41 ** PID = 26867 06/22/24 17:57:41 ** Log last touched time unavailable (No such file or directory) 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 Using config source: /etc/condor/condor_config 06/22/24 17:57:41 Using local config sources: 06/22/24 17:57:41 /etc/condor/config.d/00-chtc-global.conf 06/22/24 17:57:41 /etc/condor/config.d/01-chtc-submit.conf 06/22/24 17:57:41 /etc/condor/config.d/02-chtc-flocking.conf 06/22/24 17:57:41 /etc/condor/config.d/03-chtc-jobrouter.conf 06/22/24 17:57:41 /etc/condor/config.d/04-chtc-blacklist.conf 06/22/24 17:57:41 /etc/condor/config.d/99-osg-ss-group.conf 06/22/24 17:57:41 /etc/condor/config.d/99-roy-extras.conf 06/22/24 17:57:41 /etc/condor/condor_config.local Below is where DAGMan realizes that the montage node failed: 06/22/24 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0) 06/22/24 18:08:42 Node montage job proc (82.0.0) failed with status 1. 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Of 5 nodes total: 06/22/24 18:08:42 Done Pre Queued Post Ready Un-Ready Failed 06/22/24 18:08:42 === === === === === === === 06/22/24 18:08:42 4 0 0 0 0 0 1 06/22/24 18:08:42 0 job proc(s) currently held 06/22/24 18:08:42 Aborting DAG... 06/22/24 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001... 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of node category throttles 06/22/24 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/22/24 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/22/24 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1 DAGMan notices that one of the jobs failed because its exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved. Do you see the part where it wrote the rescue DAG? Look at the rescue DAG file. It's called a partial DAG because it indicates what part of the DAG has already been completed. username@ap40 $ cat goatbrot.dag.rescue001 # Rescue DAG file, created after running # the goatbrot.dag DAG file # Created 6 /22/2024 23 :08:42 UTC # Rescue DAG version: 2 .0.1 ( partial ) # # Total number of Nodes: 5 # Nodes premarked DONE: 4 # Nodes that failed: 1 # montage, DONE g1 DONE g2 DONE g3 DONE g4 From the comment near the top, we know that the montage node failed. Let's fix it by getting rid of the offending -h argument. Change montage.sub to look like: executable = /usr/bin/montage arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Now we can re-submit our original DAG and DAGMan will pick up where it left off. It will automatically notice the rescue DAG. If you didn't fix the problem, DAGMan would generate another rescue DAG. username@ap40 $ condor_submit_dag goatbrot.dag Running rescue DAG 1 ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 83. ----------------------------------------------------------------------- username@ap40 $ tail -f goatbrot.dag.dagman.out 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP 06/23/24 11:30:53 ** /usr/bin/condor_dagman 06/23/24 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/23/24 11:30:53 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/23/24 11:30:53 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/23/24 11:30:53 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/23/24 11:30:53 ** PID = 28576 06/23/24 11:30:53 ** Log last touched 6/22 18:08:42 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 Using config source: /etc/condor/condor_config ... Here is where DAGMAN notices that there is a rescue DAG 06/23/24 11:30:53 Parsing 1 dagfiles 06/23/24 11:30:53 Parsing goatbrot.dag ... 06/23/24 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file 06/23/24 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 06/23/24 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001 06/23/24 11:30:53 Dag contains 5 total jobs Shortly thereafter it sees that four jobs have already finished. 06/23/24 11:31:05 Bootstrapping... 06/23/24 11:31:05 Number of pre-completed nodes: 4 06/23/24 11:31:05 Registering condor_event_timer... 06/23/24 11:31:06 Sleeping for one second for log file consistency 06/23/24 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log Here is where DAGMan resubmits the montage job and waits for it to complete. 06/23/24 11:31:07 Submitting Condor Node montage job(s)... 06/23/24 11:31:07 submitting: condor_submit -a dag_node_name' '=' 'montage -a +DAGManJobId' '=' '83 -a DAGManJobId' '=' '83 -a submit_event_notes' '=' 'DAG' 'Node:' 'montage -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '\"g1,g2,g3,g4\" montage.sub 06/23/24 11:31:07 From submit: Submitting job(s). 06/23/24 11:31:07 From submit: 1 job(s) submitted to cluster 84. 06/23/24 11:31:07 assigned Condor ID (84.0.0) 06/23/24 11:31:07 Just submitted 1 job this cycle... 06/23/24 11:31:07 Currently monitoring 1 Condor log file(s) 06/23/24 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0) 06/23/24 11:31:07 Number of idle job procs: 1 06/23/24 11:31:07 Of 5 nodes total: 06/23/24 11:31:07 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:31:07 === === === === === === === 06/23/24 11:31:07 4 0 1 0 0 0 0 06/23/24 11:31:07 0 job proc(s) currently held 06/23/24 11:40:22 Currently monitoring 1 Condor log file(s) 06/23/24 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0) This is where the montage finished. 06/23/24 11:40:22 Node montage job proc (84.0.0) completed successfully. 06/23/24 11:40:22 Node montage job completed 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Of 5 nodes total: 06/23/24 11:40:22 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:40:22 === === === === === === === 06/23/24 11:40:22 5 0 0 0 0 0 0 06/23/24 11:40:22 0 job proc(s) currently held And here DAGMan decides that the work is all done. 06/23/24 11:40:22 All jobs Completed! 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of node category throttles 06/23/24 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/23/24 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/23/24 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0 Success! Now go ahead and clean up. Bonus Challenge \u00b6 If you have time, add an extra node to the DAG. Copy our original \"simple\" program, but make it exit with a 1 instead of a 0. DAGMan would consider this a failure, but you'll tell DAGMan that it's really a success. This is reasonable--many real world programs use a variety of return codes, and you might need to help DAGMan distinguish success from failure. Write a POST script that checks the return value. Check the HTCondor manual to see how to describe your post script.","title":"1.4 - Handling jobs that fail with DAGMan"},{"location":"materials/workflows/part1-ex4-failed-dag/#workflows-exercise-14-handling-a-dag-that-fails","text":"The objective of this exercise is to help you learn how DAGMan deals with job failures. DAGMan is built to help you recover from such failures.","title":"Workflows Exercise 1.4: Handling a DAG That Fails"},{"location":"materials/workflows/part1-ex4-failed-dag/#background","text":"DAGMan can handle a situation where some of the nodes in a DAG fail. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.","title":"Background"},{"location":"materials/workflows/part1-ex4-failed-dag/#breaking-things","text":"Recall that DAGMan decides that a jobs fails if its exit code is non-zero. Let's modify our montage job so that it fails. Work in the same directory where you did the last DAG. Edit montage.sub to add a -h to the arguments. It will look like this, with the -h at the beginning of the highlighted line: executable = /usr/bin/montage arguments = -h tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Submit the DAG again: username@ap40 $ condor_submit_dag goatbrot.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 77. ----------------------------------------------------------------------- Use watch to watch the jobs until they finish. In a separate window, use tail --lines=500 -f goatbrot.dag.dagman.out to watch what DAGMan does. 06/22/24 17:57:41 Setting maximum accepts per cycle 8. 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP 06/22/24 17:57:41 ** /usr/bin/condor_dagman 06/22/24 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/22/24 17:57:41 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/22/24 17:57:41 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/22/24 17:57:41 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/22/24 17:57:41 ** PID = 26867 06/22/24 17:57:41 ** Log last touched time unavailable (No such file or directory) 06/22/24 17:57:41 ****************************************************** 06/22/24 17:57:41 Using config source: /etc/condor/condor_config 06/22/24 17:57:41 Using local config sources: 06/22/24 17:57:41 /etc/condor/config.d/00-chtc-global.conf 06/22/24 17:57:41 /etc/condor/config.d/01-chtc-submit.conf 06/22/24 17:57:41 /etc/condor/config.d/02-chtc-flocking.conf 06/22/24 17:57:41 /etc/condor/config.d/03-chtc-jobrouter.conf 06/22/24 17:57:41 /etc/condor/config.d/04-chtc-blacklist.conf 06/22/24 17:57:41 /etc/condor/config.d/99-osg-ss-group.conf 06/22/24 17:57:41 /etc/condor/config.d/99-roy-extras.conf 06/22/24 17:57:41 /etc/condor/condor_config.local Below is where DAGMan realizes that the montage node failed: 06/22/24 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0) 06/22/24 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0) 06/22/24 18:08:42 Node montage job proc (82.0.0) failed with status 1. 06/22/24 18:08:42 Number of idle job procs: 0 06/22/24 18:08:42 Of 5 nodes total: 06/22/24 18:08:42 Done Pre Queued Post Ready Un-Ready Failed 06/22/24 18:08:42 === === === === === === === 06/22/24 18:08:42 4 0 0 0 0 0 1 06/22/24 18:08:42 0 job proc(s) currently held 06/22/24 18:08:42 Aborting DAG... 06/22/24 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001... 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/22/24 18:08:42 Note: 0 total job deferrals because of node category throttles 06/22/24 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/22/24 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/22/24 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1 DAGMan notices that one of the jobs failed because its exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved. Do you see the part where it wrote the rescue DAG? Look at the rescue DAG file. It's called a partial DAG because it indicates what part of the DAG has already been completed. username@ap40 $ cat goatbrot.dag.rescue001 # Rescue DAG file, created after running # the goatbrot.dag DAG file # Created 6 /22/2024 23 :08:42 UTC # Rescue DAG version: 2 .0.1 ( partial ) # # Total number of Nodes: 5 # Nodes premarked DONE: 4 # Nodes that failed: 1 # montage, DONE g1 DONE g2 DONE g3 DONE g4 From the comment near the top, we know that the montage node failed. Let's fix it by getting rid of the offending -h argument. Change montage.sub to look like: executable = /usr/bin/montage arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm output = montage.out error = montage.err log = montage.log request_memory = 1GB request_disk = 1GB request_cpus = 1 queue Now we can re-submit our original DAG and DAGMan will pick up where it left off. It will automatically notice the rescue DAG. If you didn't fix the problem, DAGMan would generate another rescue DAG. username@ap40 $ condor_submit_dag goatbrot.dag Running rescue DAG 1 ----------------------------------------------------------------------- File for submitting this DAG to Condor : goatbrot.dag.condor.sub Log of DAGMan debugging messages : goatbrot.dag.dagman.out Log of Condor library output : goatbrot.dag.lib.out Log of Condor library error messages : goatbrot.dag.lib.err Log of the life of condor_dagman itself : goatbrot.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 83. ----------------------------------------------------------------------- username@ap40 $ tail -f goatbrot.dag.dagman.out 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP 06/23/24 11:30:53 ** /usr/bin/condor_dagman 06/23/24 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 06/23/24 11:30:53 ** Configuration: subsystem:DAGMAN local: class:DAEMON 06/23/24 11:30:53 ** $CondorVersion: 23.9.0 2024-07-02 BuildID: 742617 PackageID: 23.9.0-0.742617 GitSHA: 5acb07ea RC $ 06/23/24 11:30:53 ** $CondorPlatform: x86_64_AlmaLinux9 $ 06/23/24 11:30:53 ** PID = 28576 06/23/24 11:30:53 ** Log last touched 6/22 18:08:42 06/23/24 11:30:53 ****************************************************** 06/23/24 11:30:53 Using config source: /etc/condor/condor_config ... Here is where DAGMAN notices that there is a rescue DAG 06/23/24 11:30:53 Parsing 1 dagfiles 06/23/24 11:30:53 Parsing goatbrot.dag ... 06/23/24 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file 06/23/24 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 06/23/24 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001 06/23/24 11:30:53 Dag contains 5 total jobs Shortly thereafter it sees that four jobs have already finished. 06/23/24 11:31:05 Bootstrapping... 06/23/24 11:31:05 Number of pre-completed nodes: 4 06/23/24 11:31:05 Registering condor_event_timer... 06/23/24 11:31:06 Sleeping for one second for log file consistency 06/23/24 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log Here is where DAGMan resubmits the montage job and waits for it to complete. 06/23/24 11:31:07 Submitting Condor Node montage job(s)... 06/23/24 11:31:07 submitting: condor_submit -a dag_node_name' '=' 'montage -a +DAGManJobId' '=' '83 -a DAGManJobId' '=' '83 -a submit_event_notes' '=' 'DAG' 'Node:' 'montage -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '\"g1,g2,g3,g4\" montage.sub 06/23/24 11:31:07 From submit: Submitting job(s). 06/23/24 11:31:07 From submit: 1 job(s) submitted to cluster 84. 06/23/24 11:31:07 assigned Condor ID (84.0.0) 06/23/24 11:31:07 Just submitted 1 job this cycle... 06/23/24 11:31:07 Currently monitoring 1 Condor log file(s) 06/23/24 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0) 06/23/24 11:31:07 Number of idle job procs: 1 06/23/24 11:31:07 Of 5 nodes total: 06/23/24 11:31:07 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:31:07 === === === === === === === 06/23/24 11:31:07 4 0 1 0 0 0 0 06/23/24 11:31:07 0 job proc(s) currently held 06/23/24 11:40:22 Currently monitoring 1 Condor log file(s) 06/23/24 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0) 06/23/24 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0) This is where the montage finished. 06/23/24 11:40:22 Node montage job proc (84.0.0) completed successfully. 06/23/24 11:40:22 Node montage job completed 06/23/24 11:40:22 Number of idle job procs: 0 06/23/24 11:40:22 Of 5 nodes total: 06/23/24 11:40:22 Done Pre Queued Post Ready Un-Ready Failed 06/23/24 11:40:22 === === === === === === === 06/23/24 11:40:22 5 0 0 0 0 0 0 06/23/24 11:40:22 0 job proc(s) currently held And here DAGMan decides that the work is all done. 06/23/24 11:40:22 All jobs Completed! 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0) 06/23/24 11:40:22 Note: 0 total job deferrals because of node category throttles 06/23/24 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 06/23/24 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0) 06/23/24 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0 Success! Now go ahead and clean up.","title":"Breaking Things"},{"location":"materials/workflows/part1-ex4-failed-dag/#bonus-challenge","text":"If you have time, add an extra node to the DAG. Copy our original \"simple\" program, but make it exit with a 1 instead of a 0. DAGMan would consider this a failure, but you'll tell DAGMan that it's really a success. This is reasonable--many real world programs use a variety of return codes, and you might need to help DAGMan distinguish success from failure. Write a POST script that checks the return value. Check the HTCondor manual to see how to describe your post script.","title":"Bonus Challenge"},{"location":"materials/workflows/part1-ex5-challenges/","text":"pre em { font-style: normal; background-color: yellow; } pre strong { font-style: normal; font-weight: bold; color: \\#008; } Bonus Workflows Exercise 1.5: YOUR Jobs and More on Workflows \u00b6 The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job. Challenge 1 \u00b6 Do you have any extra computation that needs to be done? Real work, from your life outside this summer school? If so, try it out on our HTCondor pool. Can't think of something? How about one of the existing distributed computing programs like distributed.net , SETI@home , Einstien@Home or others that you know. We prefer that you do your own work rather than one of these projects, but they are options. Challenge 2 \u00b6 Try to generate other Mandelbrot images. Some possible locations to look at with goatbroat: goatbrot -i 1000 -o ex1.ppm -c 0.0016437219722,-0.8224676332988 -w 2e-11 -s 1000,1000 goatbrot -i 1000 -o ex2.ppm -c 0.3958608398437499,-0.13431445312500012 -w 0.0002197265625 -s 1000,1000 goatbrot -i 1000 -o ex3.ppm -c 0.3965859374999999,-0.13378125000000013 -w 0.003515625 -s 1000,1000 You can convert ppm files with convert , like so: convert ex1.ppm ex1.jpg Now make a movie! Make a series of images where you zoom into a point in the Mandelbrot set gradually. (Those points above may work well.) Assemble these images with the \"convert\" tool which will let you convert a set of JPEG files into an MPEG movie. Challenge 3 \u00b6 Try out Pegasus. Pegasus is a workflow manager that uses DAGMan and can work in a grid environment and/or run across different types of clusters (with other queueing software). It will create the DAGs from abstract DAG descriptions and ensure they are appropriate for the location of the data and computation. Links to more information: Pegasus Website Pegasus Documentation Pegasus on OSG If you have any questions or problems, please feel free to contact the Pegasus team by emailing pegasus-support@isi.edu","title":"1.5 - Workflow Challenges"},{"location":"materials/workflows/part1-ex5-challenges/#bonus-workflows-exercise-15-your-jobs-and-more-on-workflows","text":"The objective of this exercise is to learn the very basics of running a set of jobs, where our set is just one job.","title":"Bonus Workflows Exercise 1.5: YOUR Jobs and More on Workflows"},{"location":"materials/workflows/part1-ex5-challenges/#challenge-1","text":"Do you have any extra computation that needs to be done? Real work, from your life outside this summer school? If so, try it out on our HTCondor pool. Can't think of something? How about one of the existing distributed computing programs like distributed.net , SETI@home , Einstien@Home or others that you know. We prefer that you do your own work rather than one of these projects, but they are options.","title":"Challenge 1"},{"location":"materials/workflows/part1-ex5-challenges/#challenge-2","text":"Try to generate other Mandelbrot images. Some possible locations to look at with goatbroat: goatbrot -i 1000 -o ex1.ppm -c 0.0016437219722,-0.8224676332988 -w 2e-11 -s 1000,1000 goatbrot -i 1000 -o ex2.ppm -c 0.3958608398437499,-0.13431445312500012 -w 0.0002197265625 -s 1000,1000 goatbrot -i 1000 -o ex3.ppm -c 0.3965859374999999,-0.13378125000000013 -w 0.003515625 -s 1000,1000 You can convert ppm files with convert , like so: convert ex1.ppm ex1.jpg Now make a movie! Make a series of images where you zoom into a point in the Mandelbrot set gradually. (Those points above may work well.) Assemble these images with the \"convert\" tool which will let you convert a set of JPEG files into an MPEG movie.","title":"Challenge 2"},{"location":"materials/workflows/part1-ex5-challenges/#challenge-3","text":"Try out Pegasus. Pegasus is a workflow manager that uses DAGMan and can work in a grid environment and/or run across different types of clusters (with other queueing software). It will create the DAGs from abstract DAG descriptions and ensure they are appropriate for the location of the data and computation. Links to more information: Pegasus Website Pegasus Documentation Pegasus on OSG If you have any questions or problems, please feel free to contact the Pegasus team by emailing pegasus-support@isi.edu","title":"Challenge 3"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index b1b3f01a0dfac18872edee2b2b8d4d0cd2a79d30..33858bec26696e99bac745553d4417686d4088ab 100644 GIT binary patch delta 15 WcmdnMwt