Skip to content

Run Pegasus Workflow across Multiple HTCondor Pools in Flocking mode

dcvan24 edited this page Sep 13, 2017 · 3 revisions

Install workflow submitter

The workflow submitter consists of HTCondor master and submitter that run in the flocking mode. Pegasus is installed on the HTCondor submitter. The workflow submitter flock workflow jobs to other HTCondor pools with available resources.

To install the workflow submitter, follow the instructions on the main page to install the HTCondor master and submitter using Docker. To enable flocking,

  1. Set TCP_FORWARDING_HOST in condor_config.local.submitter with the public/external IP address of the host
  2. Set FLOCK_TO as a comma-separated list of HTCondor pool addresses where the workflow jobs will flock to
  3. Provide proper credentials, certificates and keys for flocking to secured HTCondor pools
  4. Run docker_run_htcondor.sh -f and wait for 60 seconds for the submitter to be ready

Deploy a HTCondor pool

The HTCondor pool will offer resources for running flocking jobs. As of writing, it can be installed on a single node with Ubuntu 14.04 using condor_setup/ubuntu_install.sh <flock-from-list>. The <flock-from-list> should be a comma-separated list of host addresses where the workflow jobs will flock from.

Note that the default HTCondor deployment defaults to a low security level (CLAIMTOBE) and requires no authentication for flocking jobs. The default security configuration should only be used for development and test. For more sophisticated security configuration, please refer to this page.

Submit a workflow

There is an example Pegasus workflow under the home directory of condor_pool in the HTCondor submitter. The workflow consists of a configurable number of parallel jobs, each of which sleeps for 120 seconds and writes its finishing time and hostname as output. To submit the workflow,

  1. Log in the HTCondor submitter: docker exec -it condor-submitter /bin/bash
  2. Switch to user condor_pool and enter the workflow directory: su - condor_pool; cd ~/example
  3. Submit the workflow: ./submit <level-of-parallelism>

It can be observed that slots on the destination HTCondor pool are claimed and busy as below:

$ condor_status
Name         OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot2@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot3@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot4@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot5@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot6@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot7@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot8@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot9@flock  LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot10@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot11@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot12@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot13@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot14@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot15@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot16@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot17@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:03
slot18@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot19@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot20@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot21@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot22@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot23@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot24@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot25@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot26@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot27@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot28@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot29@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot30@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot31@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot32@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot33@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot34@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot35@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot36@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot37@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot38@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot39@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02
slot40@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot41@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot42@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot43@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot44@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot45@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot46@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot47@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:01
slot48@flock LINUX      X86_64 Claimed   Busy      0.000 2681  0+00:00:02

                     Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

        X86_64/LINUX    48     0      48         0       0          0        0      0

               Total    48     0      48         0       0          0        0      0

On the submitter, it can be observed that the jobs are running in parallel as below:

[condor_pool@condor-submitter example]$ pegasus-status -l /home/condor_pool/example/dags/condor_pool/pegasus/sleep/20170913T190452+0000
STAT  IN_STATE  JOB                                                                                      
Run      03:44  sleep-0 ( /home/condor_pool/example/dags/condor_pool/pegasus/sleep/20170913T190452+0000 )
Run      01:59   ┣━sleep_ID0000042                                                                       
Run      01:59   ┣━sleep_ID0000041                                                                       
Run      01:59   ┣━sleep_ID0000040                                                                       
Run      01:59   ┣━sleep_ID0000002                                                                       
Run      01:59   ┣━sleep_ID0000046                                                                       
Run      01:59   ┣━sleep_ID0000001                                                                       
Run      01:59   ┣━sleep_ID0000045                                                                       
Run      01:59   ┣━sleep_ID0000044                                                                       
Run      01:59   ┣━sleep_ID0000043                                                                       
Run      01:59   ┣━sleep_ID0000017                                                                       
Run      01:59   ┣━sleep_ID0000016                                                                       
Run      01:59   ┣━sleep_ID0000015                                                                       
Run      01:59   ┣━sleep_ID0000059                                                                       
Run      01:59   ┣━sleep_ID0000014                                                                       
Run      01:59   ┣━sleep_ID0000058                                                                       
Run      01:59   ┣━sleep_ID0000019                                                                       
Run      01:59   ┣━sleep_ID0000018                                                                       
Run      01:59   ┣━sleep_ID0000053                                                                       
Run      01:59   ┣━sleep_ID0000052                                                                       
Run      01:59   ┣━sleep_ID0000051                                                                       
Run      01:59   ┣━sleep_ID0000050                                                                       
Run      01:59   ┣━sleep_ID0000013                                                                       
Run      01:59   ┣━sleep_ID0000057                                                                       
Run      01:59   ┣━sleep_ID0000012                                                                       
Run      01:59   ┣━sleep_ID0000056                                                                       
Run      01:59   ┣━sleep_ID0000011                                                                       
Run      01:59   ┣━sleep_ID0000055                                                                       
Run      01:59   ┣━sleep_ID0000010                                                                       
Run      01:59   ┣━sleep_ID0000054                                                                       
Run      01:59   ┣━sleep_ID0000028                                                                       
Run      01:59   ┣━sleep_ID0000027                                                                       
Run      01:59   ┣━sleep_ID0000026                                                                       
Run      01:59   ┣━sleep_ID0000025                                                                       
Run      01:59   ┣━sleep_ID0000029                                                                       
Run      01:59   ┣━sleep_ID0000020                                                                       
Run      01:59   ┣━sleep_ID0000024                                                                       
Run      01:59   ┣━sleep_ID0000023                                                                       
Run      01:59   ┣━sleep_ID0000022                                                                       
Run      00:54   ┣━sleep_ID0000021                                                                       
Run      00:52   ┣━sleep_ID0000060                                                                       
Run      00:52   ┣━sleep_ID0000039                                                                       
Run      00:49   ┣━sleep_ID0000038                                                                       
Run      00:46   ┣━sleep_ID0000037                                                                       
Run      00:43   ┣━sleep_ID0000036                                                                       
Run      00:40   ┣━sleep_ID0000031                                                                       
Run      00:38   ┣━sleep_ID0000030                                                                       
Run      00:37   ┣━sleep_ID0000035                                                                       
Run      00:35   ┣━sleep_ID0000034                                                                       
Idle     02:09   ┣━sleep_ID0000033                                                                       
Idle     02:09   ┗━sleep_ID0000032                                                                       
Summary: 51 Condor jobs total (I:2 R:49)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                                                                   
   10     0     0    50     0    12     0  16.7 Running */home/condor_pool/example/dags/condor_pool/pegasus/sleep/20170913T190452+0000/sleep-0.dag
Summary: 1 DAG total (Running:1)