-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocate host copy on demand once data is moved to the host #711
base: master
Are you sure you want to change the base?
Conversation
2e69d6e
to
240cb52
Compare
We could replace the special handling of arenas in https://github.com/ICLDisco/parsec/blob/master/parsec/data.c#L49 with this mechanism. |
Provide callbacks on the parsec_data_copy_t to allocate and release the host memory. If a data has been evicted and there is only one device using it we can safely release the memory back to the application. Otherwise the memory will remain allocated. Signed-off-by: Joseph Schuchart <[email protected]>
240cb52
to
04f24df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this PR correctly you are proposing a storage mechanisms out of device memory for data copies (which the data copy remains valid). In this case I would not use the main memory as the backend storage but a new specialized device for storage. If such device exists, then long lasting copies shall be stowed there, otherwise they will go in main memory.
@@ -1177,6 +1177,7 @@ parsec_default_gpu_stage_in(parsec_gpu_task_t *gtask, | |||
for(int i = 0; i < task->task_class->nb_flows; i++) { | |||
if( !(flow_mask & (1U << i)) ) continue; | |||
source = gtask->sources[i]; | |||
assert(source->device_private != NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed ? A predecessor can provide a NULL data on a otherwise valid dependency. How is that case handled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.
@@ -1237,6 +1238,12 @@ parsec_default_gpu_stage_out(parsec_gpu_task_t *gtask, | |||
dir = parsec_device_gpu_transfer_direction_d2d; | |||
} else { | |||
dir = parsec_device_gpu_transfer_direction_d2h; | |||
if (dest->device_private == NULL && dest->alloc_cb != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the copy on the main memory somehow exists but it points to non existing private data (aka. it has been never used before). Why does it exists then ? And why do you want to allocate it via a mechanism different than the traditional arena (which is more or less similar to this but instead of having a specialized alloc/free per copy we have it per arena) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We always have a host copy, which may not have device private memory allocated. That is the way we currently provide inputs for flows (instead of data). Unless we overhaul PaRSEC we need to pass a data copy. And I don't want to allocate device private for all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason we currently pass the CPU copy between tasks, is because PaRSEC lacks a proper mechanism to acquire the accelerator data from the CPU (other than the pushout mechanism at the end of the execution of the GPU task). But, with the PR that allows data to be sent and received directly from/to GPU memory, there is no reason to ever need the data on the CPU, unless for final step of the computation before returning it to the user (in which case we could force a pushout).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then an application creates GPU copies and allocates memory on devices on devices randomly selected for all tasks submitted? That defeats the automatic device selection and creates massive pressure on the zone allocators, for no good reason. The clean way is to pass parsec_data_t*
instead of data copies as inputs for device tasks and have the runtime figure out what data copies are current and where to allocate any copies. But that requires a significant rewrite for which no one has time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pass the parsec_data_t*
between tasks, then we can only have one single version of the data at any moment, because only the most up-to-date version can be found via the data_t.
What would be such a device? Evicting to other devices is an option but they might suffer the same memory pressure as the device from which we are evicting so I'm not sure that is useful... The host memory is the memory that is least used when everything is offloaded to devices so evicting there makes the most sense. |
It would be a specialized storage device, such as disk or NVme or, something an NVLINK away from the accelerator, or at worst, the main memory. No computational support, but a larger memory zone that can be used for relieve pressure from the other, compute-capable but memory-limited, devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now used in TTG for the MRA implementation to cut out almost all host allocations and allocate host memory on-demand for evicted data or data that is pushed out. There the host memory is allocated from a pinned memory pool.
Changes needed to make it work are in https://github.com/devreal/ttg/compare/serialize-buffer-query...devreal:ttg:serialize-buffer-query-and-copy-alloc?expand=1.
@@ -1177,6 +1177,7 @@ parsec_default_gpu_stage_in(parsec_gpu_task_t *gtask, | |||
for(int i = 0; i < task->task_class->nb_flows; i++) { | |||
if( !(flow_mask & (1U << i)) ) continue; | |||
source = gtask->sources[i]; | |||
assert(source->device_private != NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.
Provide callbacks on the parsec_data_copy_t to allocate and release the host memory. If a data has been evicted and there is only one device using it we can safely release the memory back to the application. Otherwise the memory will remain allocated.