-
Notifications
You must be signed in to change notification settings - Fork 1
Run Pegasus Workflow across Multiple HTCondor Pools in Flocking mode
The workflow submitter consists of HTCondor master and submitter that run in the flocking mode. Pegasus is installed on the HTCondor submitter. The workflow submitter flock workflow jobs to other HTCondor pools with available resources.
To install the workflow submitter, follow the instructions on the main page to install the HTCondor master and submitter using Docker. To enable flocking,
- Set
TCP_FORWARDING_HOST
incondor_config.local.submitter
with the public/external IP address of the host - Set
FLOCK_TO
as a comma-separated list of HTCondor pool addresses where the workflow jobs will flock to - Provide proper credentials, certificates and keys for flocking to secured HTCondor pools
- Run
docker_run_htcondor.sh -f
and wait for 60 seconds for the submitter to be ready
The HTCondor pool will offer resources for running flocking jobs. As of writing, it can be installed on a single node with Ubuntu 14.04 using condor_setup/ubuntu_install.sh <flock-from-list>
. The <flock-from-list>
should be a comma-separated list of host addresses where the workflow jobs will flock from.
Note that the default HTCondor deployment defaults to a low security level (CLAIMTOBE
) and requires no authentication for flocking jobs. The default security configuration should only be used for development and test. For more sophisticated security configuration, please refer to this page.
There is an example Pegasus workflow under the home directory of condor_pool
in the HTCondor submitter. The workflow consists of a configurable number of parallel jobs, each of which sleeps for 120 seconds and writes its finishing time and hostname as output. To submit the workflow,
- Log in the HTCondor submitter:
docker exec -it condor-submitter /bin/bash
- Switch to user
condor_pool
and enter the workflow directory:su - condor_pool; cd ~/example
- Submit the workflow:
./submit <level-of-parallelism>
It can be observed that slots on the destination HTCondor pool are claimed and busy as below:
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot2@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot3@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot4@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot5@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot6@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot7@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot8@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot9@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot10@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot11@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot12@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot13@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot14@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot15@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot16@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot17@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:03
slot18@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot19@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot20@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot21@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot22@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot23@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot24@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot25@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot26@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot27@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot28@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot29@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot30@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot31@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot32@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot33@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot34@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot35@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot36@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot37@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot38@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot39@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
slot40@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot41@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot42@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot43@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot44@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot45@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot46@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot47@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:01
slot48@flock LINUX X86_64 Claimed Busy 0.000 2681 0+00:00:02
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 48 0 48 0 0 0 0 0
Total 48 0 48 0 0 0 0 0
On the submitter, it can be observed that the jobs are running in parallel as below:
[condor_pool@condor-submitter example]$ pegasus-status -l /home/condor_pool/example/dags/condor_pool/pegasus/sleep/20170913T190452+0000
STAT IN_STATE JOB
Run 03:44 sleep-0 ( /home/condor_pool/example/dags/condor_pool/pegasus/sleep/20170913T190452+0000 )
Run 01:59 ┣━sleep_ID0000042
Run 01:59 ┣━sleep_ID0000041
Run 01:59 ┣━sleep_ID0000040
Run 01:59 ┣━sleep_ID0000002
Run 01:59 ┣━sleep_ID0000046
Run 01:59 ┣━sleep_ID0000001
Run 01:59 ┣━sleep_ID0000045
Run 01:59 ┣━sleep_ID0000044
Run 01:59 ┣━sleep_ID0000043
Run 01:59 ┣━sleep_ID0000017
Run 01:59 ┣━sleep_ID0000016
Run 01:59 ┣━sleep_ID0000015
Run 01:59 ┣━sleep_ID0000059
Run 01:59 ┣━sleep_ID0000014
Run 01:59 ┣━sleep_ID0000058
Run 01:59 ┣━sleep_ID0000019
Run 01:59 ┣━sleep_ID0000018
Run 01:59 ┣━sleep_ID0000053
Run 01:59 ┣━sleep_ID0000052
Run 01:59 ┣━sleep_ID0000051
Run 01:59 ┣━sleep_ID0000050
Run 01:59 ┣━sleep_ID0000013
Run 01:59 ┣━sleep_ID0000057
Run 01:59 ┣━sleep_ID0000012
Run 01:59 ┣━sleep_ID0000056
Run 01:59 ┣━sleep_ID0000011
Run 01:59 ┣━sleep_ID0000055
Run 01:59 ┣━sleep_ID0000010
Run 01:59 ┣━sleep_ID0000054
Run 01:59 ┣━sleep_ID0000028
Run 01:59 ┣━sleep_ID0000027
Run 01:59 ┣━sleep_ID0000026
Run 01:59 ┣━sleep_ID0000025
Run 01:59 ┣━sleep_ID0000029
Run 01:59 ┣━sleep_ID0000020
Run 01:59 ┣━sleep_ID0000024
Run 01:59 ┣━sleep_ID0000023
Run 01:59 ┣━sleep_ID0000022
Run 00:54 ┣━sleep_ID0000021
Run 00:52 ┣━sleep_ID0000060
Run 00:52 ┣━sleep_ID0000039
Run 00:49 ┣━sleep_ID0000038
Run 00:46 ┣━sleep_ID0000037
Run 00:43 ┣━sleep_ID0000036
Run 00:40 ┣━sleep_ID0000031
Run 00:38 ┣━sleep_ID0000030
Run 00:37 ┣━sleep_ID0000035
Run 00:35 ┣━sleep_ID0000034
Idle 02:09 ┣━sleep_ID0000033
Idle 02:09 ┗━sleep_ID0000032
Summary: 51 Condor jobs total (I:2 R:49)
UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME
10 0 0 50 0 12 0 16.7 Running */home/condor_pool/example/dags/condor_pool/pegasus/sleep/20170913T190452+0000/sleep-0.dag
Summary: 1 DAG total (Running:1)