-
Notifications
You must be signed in to change notification settings - Fork 16
Scale up submission
When one finds Harvester does not submit enough workers (in a cycle) to fill a PQ, go the following.
Acronyms:
- PQ = PanDA Queue
First of all, check the queue configuration of the PQ recognized by Harvester with harvester-admin command:
$ harvester-admin qconf dump <your_PQ>
All parameters of the PQ will be displayed
E.g.
# /opt/harvester/local/bin/harvester-admin qconf dump CERN-PROD_UCORE_2
CERN-PROD_UCORE_2
-----------------
allowJobMixture = False
configID = 57706
ddmEndpointIn = None
getJobCriteria = None
initEventsMultipler = 2
jobType = ANY
mapType = NoJob
maxNewWorkersPerCycle = 10
maxSubmissionAttempts = 3
maxWorkers = 250
nNewWorkers = 0
nQueueLimitJob = None
nQueueLimitJobMax = None
nQueueLimitJobMin = None
nQueueLimitJobRatio = None
nQueueLimitWorker = 20
noHeartbeat = running,transferring,finished,failed
pandaQueueName = CERN-PROD_UCORE_2
prefetchEvents = False
prodSourceLabel = managed
queueName = CERN-PROD_UCORE_2
queueStatus = online
resourceType = ANY
runMode = self
...
Confirm which workflow mode the PQ is in, by checking mapType
and runMode
parameters in the queue config.
Common workflow modes:
-
Pull mode:
- queue config parameters:
mapType = NoJob runMode = self
- queue config parameters:
-
Pull-UPS (Pull & Unified Pilot Streaming) mode:
- queue config parameters:
mapType = NoJob runMode = slave
- Note that in this mode, submission of workers of the PQ is triggered by PanDA server. I.e. If PanDA server thinks that the PQ does not need to run more jobs, then it won't tell the Harvester to submit ! In this case, check with job status in PanDA and CRIC setup (maybe not enough activated panda jobs in the PQ)
- queue config parameters:
-
Push mode:
- queue config parameters:
mapType = OneToOne runMode = self
mapType
can also beOneToMany
,ManyToOne
,ManyToMany
- queue config parameters:
Make sure one knows which workflow mode the PQ is in.
If the workflow is not the one requires, one should modify the queue configuration of the PQ.
See more details about workflows supported in Harvester.
The following queue config parameters controls number of workers (queuing & total) of the PQ on Harvester:
-
maxWorkers
: Maximum unfinished workers allowed in the PQ. If number of all unfinished workers (submitted + idle + running) exceedsmaxWorkers
, Harvester will not submit more workers -
maxNewWorkersPerCycle
: Maximum workers to submit to the PQ in every submission cycle. (The submitter cycle frequency and other setting are configured in panda_harvester.cfg[submitter]
section.)
These parameters apply to both pull and push modes. Number of queuing (submitted) workers are limited by one of the following groups of parameters.
Either static:
-
nQueueLimitWorker
: Maximum queuing workers allowed in the PQ. If number of queuing workers (submitted) exceedsnQueueLimitWorker
, Harvester will not submit more workers
Or dynamic:
-
nQueueLimitWorkerMin
: Minimum queuing workers to keep in the PQ. If number of queuing workers (submitted) is less thannQueueLimitWorkerMin
, Harvester will submit more worker until reachingnQueueLimitWorkerMin
-
nQueueLimitWorkerMax
: Maximum queuing workers allowed in the PQ. If number of queuing workers (submitted) exceedsnQueueLimitWorkerMax
, Harvester will not submit more workers -
nQueueLimitWorkerRatio
: The target ratio percentage of queuing and running workers. When number of queuing workers (submitted) is betweennQueueLimitWorkerMin
andnQueueLimitWorkerMin
, if number_of_queuing_workers / number_of_running_workers >nQueueLimitWorkerRatio
*100% , then Harvester will not submit more workers
For PQs in Push mode, one of the following groups of parameters (only works for Push mode) controls number of queuing jobs (and prefetched jobs to Harvester). Note that for Push mode, caps on queuing jobs also limits number of queuing workers due to worker-job mapping.
Either static:
-
nQueueLimitJob
: Maximum queuing jobs in Harvester allowed in the PQ. If number of queuing jobs exceedsnQueueLimitJob
, Harvester will not submit more workers
Or dynamic:
-
nQueueLimitJobMin
: Minimum queuing jobs to keep in the PQ. If number of queuing jobs is less thannQueueLimitJobMin
, Harvester will submit more worker until number of queuing jobs reachesnQueueLimitJobMin
-
nQueueLimitJobMax
: Maximum queuing jobs allowed in the PQ. If number of queuing jobs exceedsnQueueLimitJobMax
, Harvester will not submit more workers -
nQueueLimitJobRatio
: The target ratio percentage of queuing and running jobs. When number of queuing jobs is betweennQueueLimitJobMin
andnQueueLimitJobMax
, if number_of_queuing_jobs / number_of_running_jobs >nQueueLimitJobRatio
*100% , then Harvester will not submit more workers
One should modify those parameters in queue config to meet the scale they required.
N.B. For Harvester PQs that have dynamic queue config (can get parameters from CRIC), one can configure those parameters under Associate Parameters on the CRIC PQ page
After modifying parameters in queue config, restart Harvester service so that new setup takes effect.
N.B. Harvester refreshes queue config every 10 minutes, so the change in queue config will eventually take effect without restarting the service. However, restarting the service guarantees all agents in Harvester to restart and run with new parameter values at once, which is usually better in manual changes cases. If the change is only about tweaking number of workers/jobs (say, nQueueLimitWorker) and one can wait, then one does not need to restart the service.
Getting started |
---|
Installation and configuration |
Testing and running |
Debugging |
Work with Middleware |
Admin FAQ |
Development guides |
---|
Development workflow |
Tagging |
Production & commissioning |
---|
Scale up submission |
Condor experiences |
Commissioning on the grid |
Production servers |
Service monitoring |
Auto Queue Configuration with CRIC |
SSH+RPC middleware setup |
Kubernetes section |
---|
Kubernetes setup |
X509 credentials |
AWS setup |
GKE setup |
CERN setup |
CVMFS installation |
Generic service accounts |
Advanced payloads |
---|
Horovod integration |