Allocate host copy on demand once data is moved to the host #711

devreal · 2024-12-05T19:33:06Z

Provide callbacks on the parsec_data_copy_t to allocate and release the host memory. If a data has been evicted and there is only one device using it we can safely release the memory back to the application. Otherwise the memory will remain allocated.

devreal · 2024-12-05T21:23:11Z

We could replace the special handling of arenas in https://github.com/ICLDisco/parsec/blob/master/parsec/data.c#L49 with this mechanism.

Provide callbacks on the parsec_data_copy_t to allocate and release the host memory. If a data has been evicted and there is only one device using it we can safely release the memory back to the application. Otherwise the memory will remain allocated. Signed-off-by: Joseph Schuchart <[email protected]>

bosilca

If I understand this PR correctly you are proposing a storage mechanisms out of device memory for data copies (which the data copy remains valid). In this case I would not use the main memory as the backend storage but a new specialized device for storage. If such device exists, then long lasting copies shall be stowed there, otherwise they will go in main memory.

bosilca · 2024-12-06T13:47:42Z

parsec/mca/device/device_gpu.c

@@ -1177,6 +1177,7 @@ parsec_default_gpu_stage_in(parsec_gpu_task_t        *gtask,
    for(int i = 0; i < task->task_class->nb_flows; i++) {
        if( !(flow_mask & (1U << i)) ) continue;
        source = gtask->sources[i];
+        assert(source->device_private != NULL);


Why is this needed ? A predecessor can provide a NULL data on a otherwise valid dependency. How is that case handled ?

We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.

bosilca · 2024-12-06T13:53:29Z

parsec/mca/device/device_gpu.c

@@ -1237,6 +1238,12 @@ parsec_default_gpu_stage_out(parsec_gpu_task_t        *gtask,
                dir = parsec_device_gpu_transfer_direction_d2d;
            } else {
                dir = parsec_device_gpu_transfer_direction_d2h;
+                if (dest->device_private == NULL && dest->alloc_cb != NULL) {


So the copy on the main memory somehow exists but it points to non existing private data (aka. it has been never used before). Why does it exists then ? And why do you want to allocate it via a mechanism different than the traditional arena (which is more or less similar to this but instead of having a specialized alloc/free per copy we have it per arena) ?

Yes. We always have a host copy, which may not have device private memory allocated. That is the way we currently provide inputs for flows (instead of data). Unless we overhaul PaRSEC we need to pass a data copy. And I don't want to allocate device private for all of them.

The only reason we currently pass the CPU copy between tasks, is because PaRSEC lacks a proper mechanism to acquire the accelerator data from the CPU (other than the pushout mechanism at the end of the execution of the GPU task). But, with the PR that allows data to be sent and received directly from/to GPU memory, there is no reason to ever need the data on the CPU, unless for final step of the computation before returning it to the user (in which case we could force a pushout).

So then an application creates GPU copies and allocates memory on devices on devices randomly selected for all tasks submitted? That defeats the automatic device selection and creates massive pressure on the zone allocators, for no good reason. The clean way is to pass parsec_data_t* instead of data copies as inputs for device tasks and have the runtime figure out what data copies are current and where to allocate any copies. But that requires a significant rewrite for which no one has time.

If we pass the parsec_data_t* between tasks, then we can only have one single version of the data at any moment, because only the most up-to-date version can be found via the data_t.

parsec/mca/device/device_gpu.c

devreal · 2024-12-06T15:01:08Z

If I understand this PR correctly you are proposing a storage mechanisms out of device memory for data copies (which the data copy remains valid). In this case I would not use the main memory as the backend storage but a new specialized device for storage. If such device exists, then long lasting copies shall be stowed there, otherwise they will go in main memory.

What would be such a device? Evicting to other devices is an option but they might suffer the same memory pressure as the device from which we are evicting so I'm not sure that is useful... The host memory is the memory that is least used when everything is offloaded to devices so evicting there makes the most sense.

bosilca · 2024-12-06T15:13:58Z

It would be a specialized storage device, such as disk or NVme or, something an NVLINK away from the accelerator, or at worst, the main memory. No computational support, but a larger memory zone that can be used for relieve pressure from the other, compute-capable but memory-limited, devices.

devreal

This is now used in TTG for the MRA implementation to cut out almost all host allocations and allocate host memory on-demand for evicted data or data that is pushed out. There the host memory is allocated from a pinned memory pool.

Changes needed to make it work are in https://github.com/devreal/ttg/compare/serialize-buffer-query...devreal:ttg:serialize-buffer-query-and-copy-alloc?expand=1.

devreal · 2024-12-10T16:34:41Z

parsec/mca/device/device_gpu.c

@@ -1177,6 +1177,7 @@ parsec_default_gpu_stage_in(parsec_gpu_task_t        *gtask,
    for(int i = 0; i < task->task_class->nb_flows; i++) {
        if( !(flow_mask & (1U << i)) ) continue;
        source = gtask->sources[i];
+        assert(source->device_private != NULL);


We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.

devreal requested a review from a team as a code owner December 5, 2024 19:33

devreal force-pushed the data_copy_alloc_callbacks branch from 2e69d6e to 240cb52 Compare December 5, 2024 21:05

devreal force-pushed the data_copy_alloc_callbacks branch from 240cb52 to 04f24df Compare December 5, 2024 21:55

bosilca reviewed Dec 6, 2024

View reviewed changes

devreal commented Dec 10, 2024

View reviewed changes

devreal changed the title ~~[WIP] Allocate host copy on demand once data is moved to the host~~ Allocate host copy on demand once data is moved to the host Dec 10, 2024

bosilca approved these changes Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocate host copy on demand once data is moved to the host #711

Allocate host copy on demand once data is moved to the host #711

devreal commented Dec 5, 2024

devreal commented Dec 5, 2024

bosilca left a comment

bosilca Dec 6, 2024

devreal Dec 10, 2024

bosilca Dec 6, 2024

devreal Dec 6, 2024

bosilca Dec 6, 2024

devreal Dec 6, 2024

bosilca Dec 6, 2024

devreal commented Dec 6, 2024

bosilca commented Dec 6, 2024

devreal left a comment

devreal Dec 10, 2024

Allocate host copy on demand once data is moved to the host #711

Are you sure you want to change the base?

Allocate host copy on demand once data is moved to the host #711

Conversation

devreal commented Dec 5, 2024

devreal commented Dec 5, 2024

bosilca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devreal commented Dec 6, 2024

bosilca commented Dec 6, 2024

devreal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment