Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocate host copy on demand once data is moved to the host #711

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Dec 5, 2024

Provide callbacks on the parsec_data_copy_t to allocate and release the host memory. If a data has been evicted and there is only one device using it we can safely release the memory back to the application. Otherwise the memory will remain allocated.

@devreal devreal requested a review from a team as a code owner December 5, 2024 19:33
@devreal devreal force-pushed the data_copy_alloc_callbacks branch from 2e69d6e to 240cb52 Compare December 5, 2024 21:05
@devreal
Copy link
Contributor Author

devreal commented Dec 5, 2024

We could replace the special handling of arenas in https://github.com/ICLDisco/parsec/blob/master/parsec/data.c#L49 with this mechanism.

Provide callbacks on the parsec_data_copy_t to allocate and release
the host memory. If a data has been evicted and there is only
one device using it we can safely release the memory back to
the application. Otherwise the memory will remain allocated.


Signed-off-by: Joseph Schuchart <[email protected]>
@devreal devreal force-pushed the data_copy_alloc_callbacks branch from 240cb52 to 04f24df Compare December 5, 2024 21:55
Copy link
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this PR correctly you are proposing a storage mechanisms out of device memory for data copies (which the data copy remains valid). In this case I would not use the main memory as the backend storage but a new specialized device for storage. If such device exists, then long lasting copies shall be stowed there, otherwise they will go in main memory.

@@ -1177,6 +1177,7 @@ parsec_default_gpu_stage_in(parsec_gpu_task_t *gtask,
for(int i = 0; i < task->task_class->nb_flows; i++) {
if( !(flow_mask & (1U << i)) ) continue;
source = gtask->sources[i];
assert(source->device_private != NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed ? A predecessor can provide a NULL data on a otherwise valid dependency. How is that case handled ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.

@@ -1237,6 +1238,12 @@ parsec_default_gpu_stage_out(parsec_gpu_task_t *gtask,
dir = parsec_device_gpu_transfer_direction_d2d;
} else {
dir = parsec_device_gpu_transfer_direction_d2h;
if (dest->device_private == NULL && dest->alloc_cb != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the copy on the main memory somehow exists but it points to non existing private data (aka. it has been never used before). Why does it exists then ? And why do you want to allocate it via a mechanism different than the traditional arena (which is more or less similar to this but instead of having a specialized alloc/free per copy we have it per arena) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We always have a host copy, which may not have device private memory allocated. That is the way we currently provide inputs for flows (instead of data). Unless we overhaul PaRSEC we need to pass a data copy. And I don't want to allocate device private for all of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason we currently pass the CPU copy between tasks, is because PaRSEC lacks a proper mechanism to acquire the accelerator data from the CPU (other than the pushout mechanism at the end of the execution of the GPU task). But, with the PR that allows data to be sent and received directly from/to GPU memory, there is no reason to ever need the data on the CPU, unless for final step of the computation before returning it to the user (in which case we could force a pushout).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then an application creates GPU copies and allocates memory on devices on devices randomly selected for all tasks submitted? That defeats the automatic device selection and creates massive pressure on the zone allocators, for no good reason. The clean way is to pass parsec_data_t* instead of data copies as inputs for device tasks and have the runtime figure out what data copies are current and where to allocate any copies. But that requires a significant rewrite for which no one has time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass the parsec_data_t* between tasks, then we can only have one single version of the data at any moment, because only the most up-to-date version can be found via the data_t.

parsec/mca/device/device_gpu.c Show resolved Hide resolved
@devreal
Copy link
Contributor Author

devreal commented Dec 6, 2024

If I understand this PR correctly you are proposing a storage mechanisms out of device memory for data copies (which the data copy remains valid). In this case I would not use the main memory as the backend storage but a new specialized device for storage. If such device exists, then long lasting copies shall be stowed there, otherwise they will go in main memory.

What would be such a device? Evicting to other devices is an option but they might suffer the same memory pressure as the device from which we are evicting so I'm not sure that is useful... The host memory is the memory that is least used when everything is offloaded to devices so evicting there makes the most sense.

@bosilca
Copy link
Contributor

bosilca commented Dec 6, 2024

It would be a specialized storage device, such as disk or NVme or, something an NVLINK away from the accelerator, or at worst, the main memory. No computational support, but a larger memory zone that can be used for relieve pressure from the other, compute-capable but memory-limited, devices.

Copy link
Contributor Author

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now used in TTG for the MRA implementation to cut out almost all host allocations and allocate host memory on-demand for evicted data or data that is pushed out. There the host memory is allocated from a pinned memory pool.

Changes needed to make it work are in https://github.com/devreal/ttg/compare/serialize-buffer-query...devreal:ttg:serialize-buffer-query-and-copy-alloc?expand=1.

@@ -1177,6 +1177,7 @@ parsec_default_gpu_stage_in(parsec_gpu_task_t *gtask,
for(int i = 0; i < task->task_class->nb_flows; i++) {
if( !(flow_mask & (1U << i)) ) continue;
source = gtask->sources[i];
assert(source->device_private != NULL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.

@devreal devreal changed the title [WIP] Allocate host copy on demand once data is moved to the host Allocate host copy on demand once data is moved to the host Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants