Load balancing across GPUs #1

josephjohnjj · 2022-09-20T01:48:58Z

The manager of a GPU device identifies starvation in other GPUs within a node and migrates tasks to the starving node.

profiling is done for each task execution and the required information about the task executed is written to the trace. The execution time is not explicitly calculated as it can be found from each event begin and end in the trace.

same-name cache variable and breaks findHWLOC from PaRSECConfig.cmake Signed-off-by: Aurelien Bouteiller <[email protected]>

handle both cases

find_data_size() was using number of parameters to find the total data the task was operating on. This was corrected, now we use number of flows to calculate the total data.

First arena is queried for the task data size. If arena is NULL original is queried for the data size.

…ar device queue. A device queue is created for each gpu device to hold the migrated tasks (migrated_task_list). The parsec_cuda_kernel_schedule() function schedules the migrated task to the correct queue.

…. parsec_cuda_kernel_dequeue() schedules a migrated task to the gpu device. This function will be called in __parsec_context_wait() just before parsec_current_scheduler->module.select(). This will ensure that the migrated tasks will get priority over new tasks. When a compute thread calls this function, it is forced to try to be a manager of the device. If the device already has a manager, the compute thread passes the control of the task to the manager. If not the compute thread will become the manager.

…e a thread to be a manager thread, if there are any tasks migrated to a particular device. This will also ensure that a migrated task gets priority in execution when compared to a new task. Using migrate_if_starving(), the manager checks if there are starving devices. This checking is done before a new task is selected for execution. If there are any starving devices, the manager migrate tasks to the starving device, if there are available tasks to migrate.

Decrement the gpu task count when a task is executed by the gpu. Decrement the task at the dealer gpu when task is migrated. Increment the task at the straving gpu when the migrated task is recieved.

if cuda is available, the cuda nvml library will be available along with it.

…ad of parsec_cuda_set_device_task()

…ex, while the correct index was dealer_device->super.device_index-2

This policy makes sure that only tasks with an affinity to the starving node is migrated.

should be carried out by the GPU manager thread or some other worker thread.

Some other minor changes.

perc of evictions wrt stage in.

The second thread that offloads a task to the device is trasnistioned to a manager that handles only task migration.

Only to be used for testing purposes.

…eams in the device. 2. Same starvation condition used on both dealer and starving device.

…cuda_migrate.c

2. Functions and variables renamed. 3. More statistics added.

There is no need to track task count at every stage.

before the parsec_complete() was called. This was correted.

mapped for iterative applications.

parsec_cuda_iterative = 1 maps only migrated task while parsec_cuda_iterative = 2 maps al tasks

2. Add zone_free data to gpu_mem_lru when migrating

2. Documentation updated.

2. Code cleanup 3. documenattion updated

Memory leak problem addressed by using a temporary list.

josephjohnjj and others added 30 commits April 2, 2022 08:50

added new profiling module task granularity

e620ab7

profiling is done for each task execution and the required information about the task executed is written to the trace. The execution time is not explicitly calculated as it can be found from each event begin and end in the trace.

added new profiling module task granularity

5a393a5

profiling is done for each task execution and the required information about the task executed is written to the trace. The execution time is not explicitly calculated as it can be found from each event begin and end in the trace.

task_characteristics_t structure updated

e47190e

resolved conflict merge

22f1c2e

find_data_size() updated to deal with corner cases

566297f

Since new policy, setting the non-cache variable has no effect on the

b5764db

same-name cache variable and breaks findHWLOC from PaRSECConfig.cmake Signed-off-by: Aurelien Bouteiller <[email protected]>

PRIVATE/PUBLIC_HEADER_H files may be in the build dir or the source dir,

341a2f3

handle both cases

find_data_size() corrected

5bf8233

find_data_size() was using number of parameters to find the total data the task was operating on. This was corrected, now we use number of flows to calculate the total data.

find_data_size() updated.

948c0fc

First arena is queried for the task data size. If arena is NULL original is queried for the data size.

find_data_size() simplified

ce9621a

migration codes

0504d41

parsec_cuda_kernel_schedule() schedules a migrated task to a particul…

2ab59a5

…ar device queue. A device queue is created for each gpu device to hold the migrated tasks (migrated_task_list). The parsec_cuda_kernel_schedule() function schedules the migrated task to the correct queue.

Increment the gpu task count when a task is added to the qpu queue.

89e9482

Decrement the gpu task count when a task is executed by the gpu. Decrement the task at the dealer gpu when task is migrated. Increment the task at the straving gpu when the migrated task is recieved.

Testcase HelloWorldCuda.jdf added

7fd3be6

PARSEC_HAVE_CUDA_NVML is used to check if the cuda nvml library exists.

0e98372

NVML library linked using cmake

80149d0

PARSEC_HAVE_CUDA_NVML check changed to PARSEC_HAVE_CUDA.

161bb68

if cuda is available, the cuda nvml library will be available along with it.

NVML header file added

b32f4a7

testing first stage protocol

387c163

task counting correcetd. parsec_cuda_set_device_load() was used inste…

c50a046

…ad of parsec_cuda_set_device_task()

device index corrected. We were using dealer_device->super.device_ind…

1e9df02

…ex, while the correct index was dealer_device->super.device_index-2

Task counting in each device corrected

02acc55

parsec_cuda_migrate_fini() updated

41ba480

test first stage migration

2ab276f

Task are made singleton when enqueued and dequeued

bfbe906

first level migration working

6cac4a9

starvation check added

a12bbca

task name printed using parsec_task_snprintf()

0270b42

josephjohnjj added 29 commits October 12, 2022 20:39

New selection policy 'affinity-only' added.

d360d02

This policy makes sure that only tasks with an affinity to the starving node is migrated.

New mca parameter added to decide whether a task completion

2d9b6e8

should be carried out by the GPU manager thread or some other worker thread.

Statistics collected on the number of data evictions.

f314734

Some other minor changes.

selection policy updated to use the queues better.

a073f2c

statistics updated to count the number of stage in and the

44ffba9

perc of evictions wrt stage in.

data resuse selection policy implemented

14b72d9

stage in calculation corrected

3855bf4

header file added

492b72d

Task migration delegated to another thread.

c91a6b7

The second thread that offloads a task to the device is trasnistioned to a manager that handles only task migration.

lock mechanism generalised for all selection policies.

8bc4c55

New mca parameter for an unfair task mapping.

526c46a

Only to be used for testing purposes.

1. Starvation condition updated. Now it is based on the number of str…

2b8ff9e

…eams in the device. 2. Same starvation condition used on both dealer and starving device.

parsec_cuda_migrate_manager() moved to parsec/mca/device/cuda/device_…

ba8f0a1

…cuda_migrate.c

1. Task completion offloaded to the co-manager.

992d60f

2. Functions and variables renamed. 3. More statistics added.

Code cleanup.

b04f102

There is no need to track task count at every stage.

parsec_cuda_co_manager corrected

bbb929d

Minor code changes

4b98c2a

When tasks completion was delegated we were releasing data

3903a6b

before the parsec_complete() was called. This was correted.

Statistics updated to count the number of task

f4f855c

mapped for iterative applications.

More option added for iterative application.

0a2fcab

parsec_cuda_iterative = 1 maps only migrated task while parsec_cuda_iterative = 2 maps al tasks

1. Check if zone is allocated before it is freed

abb30fd

2. Add zone_free data to gpu_mem_lru when migrating

single_try updated

c2ab026

1. Memory management during migration of stage-intask refined.

c9350f6

2. Documentation updated.

Make sure the data_in of the staged-in task is not NULL.

d7596e8

For stage-in task we set possible candidate for write only data as well.

21296fc

1. Conditions simplified in change_task_features()

e220c43

2. Code cleanup 3. documenattion updated

Signed-off-by: Joseph John <[email protected]>

a8a1df8

Memory leak problem addressed by using a temporary list.

patch https://github.com/ICLDisco/parsec/pull/479/files implemented

4c7cea7

change_task_features() simplified

889db22

bosilca force-pushed the master branch from c174b39 to 58e44f7 Compare October 5, 2023 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancing across GPUs #1

Load balancing across GPUs #1

josephjohnjj commented Sep 20, 2022

Load balancing across GPUs #1

Are you sure you want to change the base?

Load balancing across GPUs #1

Conversation

josephjohnjj commented Sep 20, 2022