Skip to content

Scale up submission

xzhao87 edited this page Nov 2, 2022 · 11 revisions

How to make Harvester submit more workers

When one finds Harvester does not submit enough workers (in a cycle) to fill a PQ, go the following.

Acronyms:

  • PQ = PanDA Queue

Check Queue Configuration Parameters

First of all, check the queue configuration of the PQ recognized by Harvester with harvester-admin command:

$ harvester-admin qconf dump <your_PQ>

All parameters of the PQ will be displayed

E.g.

# /opt/harvester/local/bin/harvester-admin qconf dump CERN-PROD_UCORE_2

CERN-PROD_UCORE_2
-----------------
 allowJobMixture = False
 configID = 57706
 ddmEndpointIn = None
 getJobCriteria = None
 initEventsMultipler = 2
 jobType = ANY
 mapType = NoJob
 maxNewWorkersPerCycle = 10
 maxSubmissionAttempts = 3
 maxWorkers = 250
 nNewWorkers = 0
 nQueueLimitJob = None
 nQueueLimitJobMax = None
 nQueueLimitJobMin = None
 nQueueLimitJobRatio = None
 nQueueLimitWorker = 20
 noHeartbeat = running,transferring,finished,failed
 pandaQueueName = CERN-PROD_UCORE_2
 prefetchEvents = False
 prodSourceLabel = managed
 queueName = CERN-PROD_UCORE_2
 queueStatus = online
 resourceType = ANY
 runMode = self
 ...

Workflow Mode of the PQ

Confirm which workflow mode the PQ is in, by checking mapType and runMode parameters in the queue config.

Common workflow modes:

  • Pull mode:
    • queue config parameters:
      mapType = NoJob
      runMode = self
      
  • Pull-UPS (Pull & Unified Pilot Streaming) mode:
    • queue config parameters:
      mapType = NoJob
      runMode = slave
      
    • Note that in this mode, submission of workers of the PQ is triggered by PanDA server. I.e. If PanDA server thinks that the PQ does not need to run more jobs, then it won't tell the Harvester to submit ! In this case, check with job status in PanDA and CRIC setup (maybe not enough activated panda jobs in the PQ)
  • Push mode:
    • queue config parameters:
      mapType = OneToOne
      runMode = self
      
      mapType can also be OneToMany, ManyToOne, ManyToMany

Make sure one knows which workflow mode the PQ is in.

If the workflow is not the one requires, one should modify the queue configuration of the PQ.

See more details about workflows supported in Harvester.

Parameters about Workers and Jobs of the PQ

The following queue config parameters controls number of workers (queuing & total) of the PQ on Harvester:

Overall Worker Caps

  • maxWorkers: Maximum unfinished workers allowed in the PQ. If number of all unfinished workers (submitted + idle + running) exceeds maxWorkers, Harvester will not submit more workers
  • maxNewWorkersPerCycle: Maximum workers to submit to the PQ in every submission cycle. (The submitter cycle frequency and other setting are configured in panda_harvester.cfg [submitter] section.)

Queuing Worker Caps

These parameters apply to both pull and push modes. Number of queuing (submitted) workers are limited by one of the following groups of parameters.

Either static:

  • nQueueLimitWorker: Maximum queuing workers allowed in the PQ. If number of queuing workers (submitted) exceeds nQueueLimitWorker, Harvester will not submit more workers

Or dynamic:

  • nQueueLimitWorkerMin: Minimum queuing workers to keep in the PQ. If number of queuing workers (submitted) is less than nQueueLimitWorkerMin, Harvester will submit more worker until reaching nQueueLimitWorkerMin
  • nQueueLimitWorkerMax: Maximum queuing workers allowed in the PQ. If number of queuing workers (submitted) exceeds nQueueLimitWorkerMax, Harvester will not submit more workers
  • nQueueLimitWorkerRatio: The target ratio percentage of queuing and running workers. When number of queuing workers (submitted) is between nQueueLimitWorkerMin and nQueueLimitWorkerMin, if number_of_queuing_workers / number_of_running_workers > nQueueLimitWorkerRatio*100% , then Harvester will not submit more workers

Queuing Job Caps

For PQs in Push mode, one of the following groups of parameters (only works for Push mode) controls number of queuing jobs (and prefetched jobs to Harvester). Note that for Push mode, caps on queuing jobs also limits number of queuing workers due to worker-job mapping.

Either static:

  • nQueueLimitJob: Maximum queuing jobs in Harvester allowed in the PQ. If number of queuing jobs exceeds nQueueLimitJob, Harvester will not submit more workers

Or dynamic:

  • nQueueLimitJobMin: Minimum queuing jobs to keep in the PQ. If number of queuing jobs is less than nQueueLimitJobMin, Harvester will submit more worker until number of queuing jobs reaches nQueueLimitJobMin
  • nQueueLimitJobMax: Maximum queuing jobs allowed in the PQ. If number of queuing jobs exceeds nQueueLimitJobMax, Harvester will not submit more workers
  • nQueueLimitJobRatio: The target ratio percentage of queuing and running jobs. When number of queuing jobs is between nQueueLimitJobMin and nQueueLimitJobMax, if number_of_queuing_jobs / number_of_running_jobs > nQueueLimitJobRatio*100% , then Harvester will not submit more workers

One should modify those parameters in queue config to meet the scale they required.

N.B. For Harvester PQs that have dynamic queue config (can get parameters from CRIC), one can configure those parameters under Associate Parameters on the CRIC PQ page

Restart Service

After modifying parameters in queue config, restart Harvester service so that new setup takes effect.

N.B. Harvester refreshes queue config every 10 minutes, so the change in queue config will eventually take effect without restarting the service. However, restarting the service guarantees all agents in Harvester to restart and run with new parameter values at once, which is usually better in manual changes cases. If the change is only about tweaking number of workers/jobs (say, nQueueLimitWorker) and one can wait, then one does not need to restart the service.

Clone this wiki locally